The main aim of this project is to narrate a story about crime. There are many types of crime occuring at various times and places and this project is to provide insights about crime depending on various conditions like the time of the day, location, weather conditions and various seasons of the year. It also aims to find out whether we can predict occurences of crime based on previous data avaialble and to what accuracy. (still needs work)
Can we predict if a violent crime may occur based on past data at a given time and place?
Can we find a trend in the occurence of crime during a certain month? Can we look for a trend for crime occurence along the months? Do different seasons or months affect the category or type of crime taking place?
Can we oberve any changes in occurence of crime over time within a day? Is the crime rate higher during the day time compared to the night time?
Hypothesis: We expect to see increase in crime rate during night time due to low visibility and fewer pedestrian on street. Another aspect would be that low lighting during night time would lead to poor quality of surveillance which can motivate occurrence of a crime.
What's the density of occurences of violent crimes in Baltimore? Is there a cluster of crime occurences from which we can get inferences or trends from?
Hypothesis: We expect to see an increased crime rate in the business districts because of the amount of footfall that the area experiences during any time of the day.
Can temperature affect violent crime occurences? If yes, then how?
Hypothesis : During the wintertime, the temperature outside is usually lower, so people mostly prefer to stay indoors. Based on the assumption that violent crimes happens when there is less people, we expect to see increase in the violent crime rate on days with lower temperature.
The answers to the above questions are important because getting these answers will give more information to the police department about the frequency of occurence of times at various times during the day and the locations where these crimes occur the most. This will allow them to allocate resources accordingly to help reduce crime in one of the most crime-ridden cities of the United States.
We obtained the crime dataset from the Baltimore Police Department website (https://www.baltimorepolice.org/crime-stats) and we used a Python API provided by https://www.worldweatheronline.com/ to retrieve weather data from 2013 onwards
Why Data Analysis?
The project should be graded more heavily on data analysis rather than data processing. This is because the questions that we have posed require use to go above and beyond normal data analysis. In this project, we have used K-means clustering to get the centers of various clusters in order to cluster our data. We have also extracted various features to be used in our model and we also created a Random Forest Regressor in order to predict the occurence of violent crimes on the basis of historic data. This is why our work goes beyond basic data analysis.
Requirement already satisfied: missingno in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (0.5.0) Requirement already satisfied: matplotlib in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from missingno) (3.3.4) Requirement already satisfied: scipy in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from missingno) (1.6.2) Requirement already satisfied: numpy in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from missingno) (1.20.1) Requirement already satisfied: seaborn in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from missingno) (0.11.1) Requirement already satisfied: python-dateutil>=2.1 in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from matplotlib->missingno) (2.8.1) Requirement already satisfied: pillow>=6.2.0 in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from matplotlib->missingno) (8.2.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from matplotlib->missingno) (1.3.1) Requirement already satisfied: cycler>=0.10 in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from matplotlib->missingno) (0.10.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from matplotlib->missingno) (2.4.7) Requirement already satisfied: six in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from cycler>=0.10->matplotlib->missingno) (1.15.0) Requirement already satisfied: pandas>=0.23 in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from seaborn->missingno) (1.2.4) Requirement already satisfied: pytz>=2017.3 in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (from pandas>=0.23->seaborn->missingno) (2021.1)
| X | Y | RowID | CrimeDateTime | CrimeCode | Location | Description | Inside_Outside | Weapon | Post | District | Neighborhood | Latitude | Longitude | GeoLocation | Premise | VRIName | Total_Incidents | Shape | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1421661.420 | 593584.4920 | 1 | 2021/09/24 08:00:00+00 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | (39.2959,-76.6137) | NaN | NaN | 1 | NaN |
| 1 | 1428629.529 | 592267.2422 | 2 | 2021/09/23 02:00:00+00 | 6D | 0 N WASHINGTON ST | LARCENY FROM AUTO | NaN | NaN | 212 | SOUTHEAST | BUTCHER'S HILL | 39.2922 | -76.5891 | (39.2922,-76.5891) | NaN | NaN | 1 | NaN |
| 2 | 1429981.578 | 593693.8871 | 3 | 2021/09/23 09:00:00+00 | 6J | 400 N BRADFORD ST | LARCENY | NaN | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | (39.2961,-76.5843) | NaN | NaN | 1 | NaN |
| 3 | 1433589.463 | 590796.6733 | 4 | 2021/09/23 18:27:00+00 | 6J | 300 S EAST AVE | LARCENY | NaN | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | (39.2881,-76.5716) | NaN | NaN | 1 | NaN |
| 4 | 1421304.259 | 591033.3302 | 5 | 2021/09/23 23:00:00+00 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | (39.2889,-76.615) | NaN | NaN | 1 | NaN |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 350294 entries, 0 to 350293 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X 350087 non-null float64 1 Y 350087 non-null float64 2 RowID 350294 non-null int64 3 CrimeDateTime 350270 non-null object 4 CrimeCode 350294 non-null object 5 Location 348499 non-null object 6 Description 350294 non-null object 7 Inside_Outside 302508 non-null object 8 Weapon 76056 non-null object 9 Post 349552 non-null object 10 District 349552 non-null object 11 Neighborhood 349530 non-null object 12 Latitude 350087 non-null float64 13 Longitude 350087 non-null float64 14 GeoLocation 350294 non-null object 15 Premise 302370 non-null object 16 VRIName 41597 non-null object 17 Total_Incidents 350294 non-null int64 18 Shape 0 non-null float64 dtypes: float64(5), int64(2), object(12) memory usage: 50.8+ MB
X 207 Y 207 RowID 0 CrimeDateTime 24 CrimeCode 0 Location 1795 Description 0 Inside_Outside 47786 Weapon 274238 Post 742 District 742 Neighborhood 764 Latitude 207 Longitude 207 GeoLocation 0 Premise 47924 VRIName 308697 Total_Incidents 0 Shape 350294 dtype: int64
While examining our columns and their values, we come to a conclusion that there are certain columns that we don't need for our analysis and they are redundant columns. So we drop these columns and specify inplace = True to take a deep copy.
| RowID | CrimeDateTime | CrimeCode | Location | Description | Inside_Outside | Weapon | Post | District | Neighborhood | Latitude | Longitude | GeoLocation | Premise | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2021/09/24 08:00:00+00 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | (39.2959,-76.6137) | NaN |
| 1 | 2 | 2021/09/23 02:00:00+00 | 6D | 0 N WASHINGTON ST | LARCENY FROM AUTO | NaN | NaN | 212 | SOUTHEAST | BUTCHER'S HILL | 39.2922 | -76.5891 | (39.2922,-76.5891) | NaN |
| 2 | 3 | 2021/09/23 09:00:00+00 | 6J | 400 N BRADFORD ST | LARCENY | NaN | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | (39.2961,-76.5843) | NaN |
| 3 | 4 | 2021/09/23 18:27:00+00 | 6J | 300 S EAST AVE | LARCENY | NaN | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | (39.2881,-76.5716) | NaN |
| 4 | 5 | 2021/09/23 23:00:00+00 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | (39.2889,-76.615) | NaN |
<AxesSubplot:>
We observe that there is a strong correlation between Premise and Inside_Outside which seems to indicate that some rows do not have information about the Premise where the crime occurred due to which it is difficult to ascertain if the crime occurred Inside or Outside the premises.
Similarly Post, District and Neighbourhood also have missing values that are strongly correlated to each other. This is because if we do not have information about the Post where the crime occurred then it is difficult to ascertain the Neighbourhood and District where the crime occurred.
We also observe that Longitude and Latitude also have strong correlation which means that if Longitude is not given then it most likely that the Latitude is also missing.
Now, we come to our first column that has missing values and play with it to look into it in a more detailed way.
24
We observe that we have 24 missing values in the Date Time column. The time and date the crime incident occurred is an extremely important feature in our model to predict the location and time a violent crime occured. So we cannot impute values by forward filling or backward filling because it may skew the model results. Since there are only 24 rows with missing values, we drop them.
0
datetime objects for efficient value extraction for easy manipulation of date and time¶<class 'pandas.core.frame.DataFrame'> Int64Index: 350270 entries, 0 to 350293 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RowID 350270 non-null int64 1 CrimeDateTime 350270 non-null datetime64[ns, UTC] 2 CrimeCode 350270 non-null object 3 Location 348476 non-null object 4 Description 350270 non-null object 5 Inside_Outside 302484 non-null object 6 Weapon 76032 non-null object 7 Post 349528 non-null object 8 District 349528 non-null object 9 Neighborhood 349506 non-null object 10 Latitude 350063 non-null float64 11 Longitude 350063 non-null float64 12 GeoLocation 350270 non-null object 13 Premise 302346 non-null object dtypes: datetime64[ns, UTC](1), float64(2), int64(1), object(10) memory usage: 40.1+ MB
| RowID | CrimeDateTime | CrimeCode | Location | Description | Inside_Outside | Weapon | Post | District | Neighborhood | Latitude | Longitude | GeoLocation | Premise | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2021-09-24 08:00:00+00:00 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | (39.2959,-76.6137) | NaN |
| 1 | 2 | 2021-09-23 02:00:00+00:00 | 6D | 0 N WASHINGTON ST | LARCENY FROM AUTO | NaN | NaN | 212 | SOUTHEAST | BUTCHER'S HILL | 39.2922 | -76.5891 | (39.2922,-76.5891) | NaN |
| 2 | 3 | 2021-09-23 09:00:00+00:00 | 6J | 400 N BRADFORD ST | LARCENY | NaN | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | (39.2961,-76.5843) | NaN |
| 3 | 4 | 2021-09-23 18:27:00+00:00 | 6J | 300 S EAST AVE | LARCENY | NaN | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | (39.2881,-76.5716) | NaN |
| 4 | 5 | 2021-09-23 23:00:00+00:00 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | (39.2889,-76.615) | NaN |
| RowID | CrimeDateTime | CrimeCode | Location | Description | Inside_Outside | Weapon | Post | District | Neighborhood | Latitude | Longitude | GeoLocation | Premise | date_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2021-09-24 08:00:00+00:00 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | (39.2959,-76.6137) | NaN | 2021-09-24 08:00:00 |
| 1 | 2 | 2021-09-23 02:00:00+00:00 | 6D | 0 N WASHINGTON ST | LARCENY FROM AUTO | NaN | NaN | 212 | SOUTHEAST | BUTCHER'S HILL | 39.2922 | -76.5891 | (39.2922,-76.5891) | NaN | 2021-09-23 02:00:00 |
| 2 | 3 | 2021-09-23 09:00:00+00:00 | 6J | 400 N BRADFORD ST | LARCENY | NaN | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | (39.2961,-76.5843) | NaN | 2021-09-23 09:00:00 |
| 3 | 4 | 2021-09-23 18:27:00+00:00 | 6J | 300 S EAST AVE | LARCENY | NaN | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | (39.2881,-76.5716) | NaN | 2021-09-23 18:27:00 |
| 4 | 5 | 2021-09-23 23:00:00+00:00 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | (39.2889,-76.615) | NaN | 2021-09-23 23:00:00 |
To analyze the number of crime incidents reported each year we'll extract the year of the crime from column Crime_Date using a vectorized string method str.extract() and regular expression to match the year format.
| RowID | CrimeDateTime | CrimeCode | Location | Description | Inside_Outside | Weapon | Post | District | Neighborhood | Latitude | Longitude | GeoLocation | Premise | date_time | Crime_Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2021-09-24 08:00:00+00:00 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | (39.2959,-76.6137) | NaN | 2021-09-24 08:00:00 | 2021 |
| 1 | 2 | 2021-09-23 02:00:00+00:00 | 6D | 0 N WASHINGTON ST | LARCENY FROM AUTO | NaN | NaN | 212 | SOUTHEAST | BUTCHER'S HILL | 39.2922 | -76.5891 | (39.2922,-76.5891) | NaN | 2021-09-23 02:00:00 | 2021 |
| 2 | 3 | 2021-09-23 09:00:00+00:00 | 6J | 400 N BRADFORD ST | LARCENY | NaN | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | (39.2961,-76.5843) | NaN | 2021-09-23 09:00:00 | 2021 |
| 3 | 4 | 2021-09-23 18:27:00+00:00 | 6J | 300 S EAST AVE | LARCENY | NaN | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | (39.2881,-76.5716) | NaN | 2021-09-23 18:27:00 | 2021 |
| 4 | 5 | 2021-09-23 23:00:00+00:00 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | (39.2889,-76.615) | NaN | 2021-09-23 23:00:00 | 2021 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 350289 | 350290 | 1975-06-01 00:00:00+00:00 | 2A | 4400 OLD FREDERICK RD | RAPE | I | OTHER | 822 | SOUTHWEST | UPLANDS | 39.2896 | -76.6913 | (39.2896,-76.6913) | OTHER - INSIDE | 1975-06-01 00:00:00 | 1975 |
| 350290 | 350291 | 1973-07-01 23:00:00+00:00 | 2A | 4000 SPRINGDALE AVE | RAPE | I | OTHER | 621 | NORTHWEST | CENTRAL FOREST PARK | 39.3262 | -76.6872 | (39.3262,-76.6872) | ROW/TOWNHOUSE-OCC | 1973-07-01 23:00:00 | 1973 |
| 350291 | 350292 | 1970-06-15 00:01:00+00:00 | 2A | 2400 ST STEPHENS CT | RAPE | I | OTHER | 731 | WESTERN | MONDAWMIN | 39.3100 | -76.6571 | (39.31,-76.6571) | ROW/TOWNHOUSE-OCC | 1970-06-15 00:01:00 | 1970 |
| 350292 | 350293 | 1969-07-20 21:00:00+00:00 | 2A | 5400 ROLAND AVE | RAPE | NaN | OTHER | 534 | NORTHERN | ROLAND PARK | 39.3589 | -76.6353 | (39.3589,-76.6353) | NaN | 1969-07-20 21:00:00 | 1969 |
| 350293 | 350294 | 1963-10-30 00:00:00+00:00 | 2A | 3100 FERNDALE AVE | RAPE | I | OTHER | 622 | NORTHWEST | HOWARD PARK | 39.3269 | -76.7026 | (39.3269,-76.7026) | ROW/TOWNHOUSE-OCC | 1963-10-30 00:00:00 | 1963 |
350270 rows × 16 columns
1963 1 1969 1 1970 1 1973 1 1975 1 1977 1 1978 3 1979 1 1980 1 1981 1 1982 1 1985 1 1988 1 1993 2 1995 2 1998 3 1999 2 2000 3 2001 3 2003 1 2004 2 2006 1 2007 6 2008 6 2009 7 2010 3 2011 13 2012 14 2013 17 2014 44754 2015 48223 2016 48787 2017 52180 2018 48505 2019 46415 2020 36195 2021 25111 Name: Crime_Year, dtype: int64
We'll be splitting the CrimeDateTime column into Crime_date and Crime_Time column to be able to perform analysis using them seperately.
We have used vectorized string method str.split() for splitting the column considering space between the two values as a delimeter and specifying expand=True for the function to return a dataframe.
| RowID | CrimeCode | Location | Description | Inside_Outside | Weapon | Post | District | Neighborhood | Latitude | Longitude | GeoLocation | Premise | date_time | Crime_Year | Crime_Date | Crime_Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | (39.2959,-76.6137) | NaN | 2021-09-24 08:00:00 | 2021 | 2021-09-24 | 08:00:00 |
| 1 | 2 | 6D | 0 N WASHINGTON ST | LARCENY FROM AUTO | NaN | NaN | 212 | SOUTHEAST | BUTCHER'S HILL | 39.2922 | -76.5891 | (39.2922,-76.5891) | NaN | 2021-09-23 02:00:00 | 2021 | 2021-09-23 | 02:00:00 |
| 2 | 3 | 6J | 400 N BRADFORD ST | LARCENY | NaN | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | (39.2961,-76.5843) | NaN | 2021-09-23 09:00:00 | 2021 | 2021-09-23 | 09:00:00 |
| 3 | 4 | 6J | 300 S EAST AVE | LARCENY | NaN | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | (39.2881,-76.5716) | NaN | 2021-09-23 18:27:00 | 2021 | 2021-09-23 | 18:27:00 |
| 4 | 5 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | (39.2889,-76.615) | NaN | 2021-09-23 23:00:00 | 2021 | 2021-09-23 | 23:00:00 |
742
742
764
Observation :
| RowID | CrimeCode | Location | Description | Inside_Outside | Weapon | Post | District | Neighborhood | Latitude | Longitude | GeoLocation | Premise | date_time | Crime_Year | Crime_Date | Crime_Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 469 | 470 | 7A | NaN | AUTO THEFT | Inside | NaN | NaN | NaN | NaN | NaN | NaN | (,) | STREET | 2021-09-17 01:22:43 | 2021 | 2021-09-17 | 01:22:43 |
| 476 | 477 | 7A | NaN | AUTO THEFT | Inside | NaN | NaN | NaN | NaN | NaN | NaN | (,) | STREET | 2021-09-17 01:22:43 | 2021 | 2021-09-17 | 01:22:43 |
| 477 | 478 | 7A | NaN | AUTO THEFT | Inside | NaN | NaN | NaN | NaN | NaN | NaN | (,) | STREET | 2021-09-17 01:22:43 | 2021 | 2021-09-17 | 01:22:43 |
| 1305 | 1306 | 6J | 4200 SPRING AVE | LARCENY | NaN | NaN | NaN | NaN | NaN | 39.2358 | -76.6773 | (39.2358,-76.6773) | NaN | 2021-09-10 15:22:00 | 2021 | 2021-09-10 | 15:22:00 |
| 1876 | 1877 | 4E | 4200 AUDREY AVE | COMMON ASSAULT | NaN | NaN | NaN | NaN | NaN | 39.2291 | -76.6061 | (39.2291,-76.6061) | NaN | 2021-09-05 03:45:00 | 2021 | 2021-09-05 | 03:45:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 347215 | 347216 | 6D | NaN | LARCENY FROM AUTO | O | NaN | NaN | NaN | NaN | 39.2992 | -76.7112 | (39.2992,-76.7112) | PARKING LOT-OUTSIDE | 2014-01-27 10:00:00 | 2014 | 2014-01-27 | 10:00:00 |
| 347637 | 347638 | 3NF | 300 THE FALLSWAY #RM 203 | ROBBERY - STREET | I | FIREARM | NaN | NaN | NaN | NaN | NaN | (,) | HOTEL/MOTEL | 2014-01-23 23:50:00 | 2014 | 2014-01-23 | 23:50:00 |
| 347638 | 347639 | 3NF | 300 THE FALLSWAY #RM 203 | ROBBERY - STREET | I | FIREARM | NaN | NaN | NaN | NaN | NaN | (,) | HOTEL/MOTEL | 2014-01-23 23:50:00 | 2014 | 2014-01-23 | 23:50:00 |
| 347897 | 347898 | 6G | 800 HOLLAND AVE | LARCENY | I | NaN | NaN | NaN | NaN | NaN | NaN | (,) | APT/CONDO - OCCUPIED | 2014-01-20 08:00:00 | 2014 | 2014-01-20 | 08:00:00 |
| 348251 | 348252 | 6G | 5400 SARRIL RD | LARCENY | I | NaN | NaN | NaN | NaN | NaN | NaN | (,) | ROW/TOWNHOUSE-OCC | 2014-01-17 14:00:00 | 2014 | 2014-01-17 | 14:00:00 |
742 rows × 17 columns
We observe that Neighborhood and District also has many null values where Post has null values. We dig deeper into this by verifying this
The Neighborhood column has 735 null values for rows where Post is also null The District column has 742 null values for rows where Post is also null
We observe that Post, District and Neighborhood have parallel Null values. So we cannot impute values for the Post column using Neighborhood or District. So we decide to drop the null values and go ahead with other columns.
RowID 0 CrimeCode 0 Location 1627 Description 0 Inside_Outside 47679 Weapon 273668 Post 0 District 0 Neighborhood 29 Latitude 0 Longitude 0 GeoLocation 0 Premise 47817 date_time 0 Crime_Year 0 Crime_Date 0 Crime_Time 0 dtype: int64
Now we move on to our next column, Inside_Outside column. We have the following plan of action for this column :
O 146615 I 142080 Outside 10092 Inside 3062 Name: Inside_Outside, dtype: int64
We can observe that this column should only have two unique values instead of four. So we replace the I and O with Inside and Outside
| RowID | CrimeCode | Location | Description | Weapon | Post | District | Neighborhood | Latitude | Longitude | GeoLocation | Premise | date_time | Crime_Year | Crime_Date | Crime_Time | col_I | col_O | col_nan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | (39.2959,-76.6137) | NaN | 2021-09-24 08:00:00 | 2021 | 2021-09-24 | 08:00:00 | 0 | 0 | 1 |
| 1 | 2 | 6D | 0 N WASHINGTON ST | LARCENY FROM AUTO | NaN | 212 | SOUTHEAST | BUTCHER'S HILL | 39.2922 | -76.5891 | (39.2922,-76.5891) | NaN | 2021-09-23 02:00:00 | 2021 | 2021-09-23 | 02:00:00 | 0 | 0 | 1 |
| 2 | 3 | 6J | 400 N BRADFORD ST | LARCENY | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | (39.2961,-76.5843) | NaN | 2021-09-23 09:00:00 | 2021 | 2021-09-23 | 09:00:00 | 0 | 0 | 1 |
| 3 | 4 | 6J | 300 S EAST AVE | LARCENY | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | (39.2881,-76.5716) | NaN | 2021-09-23 18:27:00 | 2021 | 2021-09-23 | 18:27:00 | 0 | 0 | 1 |
| 4 | 5 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | (39.2889,-76.615) | NaN | 2021-09-23 23:00:00 | 2021 | 2021-09-23 | 23:00:00 | 0 | 0 | 1 |
| RowID | CrimeCode | Location | Description | Weapon | Post | District | Neighborhood | Latitude | Longitude | GeoLocation | Premise | date_time | Crime_Year | Crime_Date | Crime_Time | Inside | Outside | Inside_Outside_Null | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | (39.2959,-76.6137) | NaN | 2021-09-24 08:00:00 | 2021 | 2021-09-24 | 08:00:00 | 0 | 0 | 1 |
| 1 | 2 | 6D | 0 N WASHINGTON ST | LARCENY FROM AUTO | NaN | 212 | SOUTHEAST | BUTCHER'S HILL | 39.2922 | -76.5891 | (39.2922,-76.5891) | NaN | 2021-09-23 02:00:00 | 2021 | 2021-09-23 | 02:00:00 | 0 | 0 | 1 |
| 2 | 3 | 6J | 400 N BRADFORD ST | LARCENY | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | (39.2961,-76.5843) | NaN | 2021-09-23 09:00:00 | 2021 | 2021-09-23 | 09:00:00 | 0 | 0 | 1 |
| 3 | 4 | 6J | 300 S EAST AVE | LARCENY | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | (39.2881,-76.5716) | NaN | 2021-09-23 18:27:00 | 2021 | 2021-09-23 | 18:27:00 | 0 | 0 | 1 |
| 4 | 5 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | (39.2889,-76.615) | NaN | 2021-09-23 23:00:00 | 2021 | 2021-09-23 | 23:00:00 | 0 | 0 | 1 |
Now, we have three columns :
47679
LARCENY 11555 COMMON ASSAULT 9002 LARCENY FROM AUTO 8076 AGG. ASSAULT 6173 BURGLARY 4711 ROBBERY - STREET 2828 AUTO THEFT 2792 ROBBERY - COMMERCIAL 858 ROBBERY - RESIDENCE 545 RAPE 497 ROBBERY - CARJACKING 483 ARSON 159 Name: Description, dtype: int64
Total incidents of crime : 349528 Null 47679 In 145142 Out 156707 dtype: int64
| Null | In | Out | |
|---|---|---|---|
| AGG. ASSAULT | 6173 | 16446 | 17877 |
| ARSON | 159 | 629 | 654 |
| AUTO THEFT | 2792 | 2523 | 25383 |
| BURGLARY | 4711 | 40369 | 3247 |
| COMMON ASSAULT | 9002 | 33328 | 18007 |
| HOMICIDE | 0 | 378 | 1851 |
| LARCENY | 11555 | 36690 | 28668 |
| LARCENY FROM AUTO | 8076 | 3222 | 33787 |
| RAPE | 497 | 1455 | 534 |
| ROBBERY - CARJACKING | 483 | 158 | 2667 |
| ROBBERY - COMMERCIAL | 858 | 4432 | 838 |
| ROBBERY - RESIDENCE | 545 | 3026 | 180 |
| ROBBERY - STREET | 2828 | 2124 | 18711 |
| SHOOTING | 0 | 362 | 4303 |
The better way is to keep all the rows wtih null value in the Inside/Outside column.
In the meantime, we decided to check if we can use premise data to fill the null value in the Inside/Outside column.
47817
HOMICIDE 122 SHOOTING 16 Name: Description, dtype: int64
Series([], Name: Description, dtype: int64)
There are 13.64% of null value in 'Inside_Outside' column. There are 13.68% of null value in 'Premise' column.
Our goal of the project is to predict if a violent crime will happen on a given time at a given location. For this, we need to group the below crimes to label them accordingly.
Categories:
array(['LARCENY FROM AUTO', 'LARCENY', 'HOMICIDE', 'AUTO THEFT',
'COMMON ASSAULT', 'AGG. ASSAULT', 'BURGLARY',
'ROBBERY - COMMERCIAL', 'RAPE', 'ROBBERY - STREET', 'SHOOTING',
'ROBBERY - CARJACKING', 'ARSON', 'ROBBERY - RESIDENCE'],
dtype=object)
| RowID | CrimeCode | Location | Description | Weapon | Post | District | Neighborhood | Latitude | Longitude | ... | date_time | Crime_Year | Crime_Date | Crime_Time | Inside | Outside | Inside_Outside_Null | isViolent | OtherCrime | isAuto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | ... | 2021-09-24 08:00:00 | 2021 | 2021-09-24 | 08:00:00 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 2 | 6D | 0 N WASHINGTON ST | LARCENY FROM AUTO | NaN | 212 | SOUTHEAST | BUTCHER'S HILL | 39.2922 | -76.5891 | ... | 2021-09-23 02:00:00 | 2021 | 2021-09-23 | 02:00:00 | 0 | 0 | 1 | 0 | 0 | 1 |
| 2 | 3 | 6J | 400 N BRADFORD ST | LARCENY | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | ... | 2021-09-23 09:00:00 | 2021 | 2021-09-23 | 09:00:00 | 0 | 0 | 1 | 0 | 1 | 0 |
| 3 | 4 | 6J | 300 S EAST AVE | LARCENY | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | ... | 2021-09-23 18:27:00 | 2021 | 2021-09-23 | 18:27:00 | 0 | 0 | 1 | 0 | 1 | 0 |
| 4 | 5 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | ... | 2021-09-23 23:00:00 | 2021 | 2021-09-23 | 23:00:00 | 0 | 0 | 1 | 0 | 0 | 1 |
5 rows × 22 columns
World weather online is a website which provides global weather forecast and weather API to access weather data for any location in a range of formats like XML, CSV, JSON.
We'll be using retrieve_hist_data() function from wwo-hist python package which encapsulates the weater API from World Weather Online.
We have specified the following in the retrieve_hist_data function to obtain historical weather data:
Requirement already satisfied: wwo-hist in /Users/raghul/opt/anaconda3/lib/python3.8/site-packages (0.0.7)
'/Users/raghul/Documents/Sem1/BUDT704 Python/Project'
Retrieving weather data for baltimore Currently retrieving data for baltimore: from 2014-01-01 to 2014-01-31 Time elapsed (hh:mm:ss.ms) 0:00:00.690864 Currently retrieving data for baltimore: from 2014-02-01 to 2014-02-28 Time elapsed (hh:mm:ss.ms) 0:00:01.360214 Currently retrieving data for baltimore: from 2014-03-01 to 2014-03-31 Time elapsed (hh:mm:ss.ms) 0:00:02.076490 Currently retrieving data for baltimore: from 2014-04-01 to 2014-04-30 Time elapsed (hh:mm:ss.ms) 0:00:02.685374 Currently retrieving data for baltimore: from 2014-05-01 to 2014-05-31 Time elapsed (hh:mm:ss.ms) 0:00:03.394873 Currently retrieving data for baltimore: from 2014-06-01 to 2014-06-30 Time elapsed (hh:mm:ss.ms) 0:00:04.083740 Currently retrieving data for baltimore: from 2014-07-01 to 2014-07-31 Time elapsed (hh:mm:ss.ms) 0:00:04.824221 Currently retrieving data for baltimore: from 2014-08-01 to 2014-08-31 Time elapsed (hh:mm:ss.ms) 0:00:05.549769 Currently retrieving data for baltimore: from 2014-09-01 to 2014-09-30 Time elapsed (hh:mm:ss.ms) 0:00:06.270566 Currently retrieving data for baltimore: from 2014-10-01 to 2014-10-31 Time elapsed (hh:mm:ss.ms) 0:00:07.014690 Currently retrieving data for baltimore: from 2014-11-01 to 2014-11-30 Time elapsed (hh:mm:ss.ms) 0:00:07.772701 Currently retrieving data for baltimore: from 2014-12-01 to 2014-12-31 Time elapsed (hh:mm:ss.ms) 0:00:08.475343 Currently retrieving data for baltimore: from 2015-01-01 to 2015-01-31 Time elapsed (hh:mm:ss.ms) 0:00:09.225258 Currently retrieving data for baltimore: from 2015-02-01 to 2015-02-28 Time elapsed (hh:mm:ss.ms) 0:00:09.928860 Currently retrieving data for baltimore: from 2015-03-01 to 2015-03-31 Time elapsed (hh:mm:ss.ms) 0:00:10.589895 Currently retrieving data for baltimore: from 2015-04-01 to 2015-04-30 Time elapsed (hh:mm:ss.ms) 0:00:11.270542 Currently retrieving data for baltimore: from 2015-05-01 to 2015-05-31 Time elapsed (hh:mm:ss.ms) 0:00:11.930833 Currently retrieving data for baltimore: from 2015-06-01 to 2015-06-30 Time elapsed (hh:mm:ss.ms) 0:00:12.614731 Currently retrieving data for baltimore: from 2015-07-01 to 2015-07-31 Time elapsed (hh:mm:ss.ms) 0:00:13.411764 Currently retrieving data for baltimore: from 2015-08-01 to 2015-08-31 Time elapsed (hh:mm:ss.ms) 0:00:14.158548 Currently retrieving data for baltimore: from 2015-09-01 to 2015-09-30 Time elapsed (hh:mm:ss.ms) 0:00:14.904454 Currently retrieving data for baltimore: from 2015-10-01 to 2015-10-31 Time elapsed (hh:mm:ss.ms) 0:00:15.646775 Currently retrieving data for baltimore: from 2015-11-01 to 2015-11-30 Time elapsed (hh:mm:ss.ms) 0:00:16.425790 Currently retrieving data for baltimore: from 2015-12-01 to 2015-12-31 Time elapsed (hh:mm:ss.ms) 0:00:17.178691 Currently retrieving data for baltimore: from 2016-01-01 to 2016-01-31 Time elapsed (hh:mm:ss.ms) 0:00:17.875396 Currently retrieving data for baltimore: from 2016-02-01 to 2016-02-29 Time elapsed (hh:mm:ss.ms) 0:00:18.538055 Currently retrieving data for baltimore: from 2016-03-01 to 2016-03-31 Time elapsed (hh:mm:ss.ms) 0:00:19.196497 Currently retrieving data for baltimore: from 2016-04-01 to 2016-04-30 Time elapsed (hh:mm:ss.ms) 0:00:19.827745 Currently retrieving data for baltimore: from 2016-05-01 to 2016-05-31 Time elapsed (hh:mm:ss.ms) 0:00:20.581756 Currently retrieving data for baltimore: from 2016-06-01 to 2016-06-30 Time elapsed (hh:mm:ss.ms) 0:00:21.313973 Currently retrieving data for baltimore: from 2016-07-01 to 2016-07-31 Time elapsed (hh:mm:ss.ms) 0:00:21.980798 Currently retrieving data for baltimore: from 2016-08-01 to 2016-08-31 Time elapsed (hh:mm:ss.ms) 0:00:22.691663 Currently retrieving data for baltimore: from 2016-09-01 to 2016-09-30 Time elapsed (hh:mm:ss.ms) 0:00:23.344622 Currently retrieving data for baltimore: from 2016-10-01 to 2016-10-31 Time elapsed (hh:mm:ss.ms) 0:00:24.028513 Currently retrieving data for baltimore: from 2016-11-01 to 2016-11-30 Time elapsed (hh:mm:ss.ms) 0:00:24.671566 Currently retrieving data for baltimore: from 2016-12-01 to 2016-12-31 Time elapsed (hh:mm:ss.ms) 0:00:25.344059 Currently retrieving data for baltimore: from 2017-01-01 to 2017-01-31 Time elapsed (hh:mm:ss.ms) 0:00:25.992354 Currently retrieving data for baltimore: from 2017-02-01 to 2017-02-28 Time elapsed (hh:mm:ss.ms) 0:00:26.645904 Currently retrieving data for baltimore: from 2017-03-01 to 2017-03-31 Time elapsed (hh:mm:ss.ms) 0:00:27.311854 Currently retrieving data for baltimore: from 2017-04-01 to 2017-04-30 Time elapsed (hh:mm:ss.ms) 0:00:27.926066 Currently retrieving data for baltimore: from 2017-05-01 to 2017-05-31 Time elapsed (hh:mm:ss.ms) 0:00:28.649625 Currently retrieving data for baltimore: from 2017-06-01 to 2017-06-30 Time elapsed (hh:mm:ss.ms) 0:00:29.351319 Currently retrieving data for baltimore: from 2017-07-01 to 2017-07-31 Time elapsed (hh:mm:ss.ms) 0:00:30.015053 Currently retrieving data for baltimore: from 2017-08-01 to 2017-08-31 Time elapsed (hh:mm:ss.ms) 0:00:30.672770 Currently retrieving data for baltimore: from 2017-09-01 to 2017-09-30 Time elapsed (hh:mm:ss.ms) 0:00:31.404596 Currently retrieving data for baltimore: from 2017-10-01 to 2017-10-31 Time elapsed (hh:mm:ss.ms) 0:00:32.112450 Currently retrieving data for baltimore: from 2017-11-01 to 2017-11-30 Time elapsed (hh:mm:ss.ms) 0:00:32.739226 Currently retrieving data for baltimore: from 2017-12-01 to 2017-12-31 Time elapsed (hh:mm:ss.ms) 0:00:33.381305 Currently retrieving data for baltimore: from 2018-01-01 to 2018-01-31 Time elapsed (hh:mm:ss.ms) 0:00:34.085994 Currently retrieving data for baltimore: from 2018-02-01 to 2018-02-28 Time elapsed (hh:mm:ss.ms) 0:00:34.710563 Currently retrieving data for baltimore: from 2018-03-01 to 2018-03-31 Time elapsed (hh:mm:ss.ms) 0:00:35.444591 Currently retrieving data for baltimore: from 2018-04-01 to 2018-04-30 Time elapsed (hh:mm:ss.ms) 0:00:36.157442 Currently retrieving data for baltimore: from 2018-05-01 to 2018-05-31 Time elapsed (hh:mm:ss.ms) 0:00:36.823528 Currently retrieving data for baltimore: from 2018-06-01 to 2018-06-30 Time elapsed (hh:mm:ss.ms) 0:00:37.541094 Currently retrieving data for baltimore: from 2018-07-01 to 2018-07-31 Time elapsed (hh:mm:ss.ms) 0:00:38.348899 Currently retrieving data for baltimore: from 2018-08-01 to 2018-08-31 Time elapsed (hh:mm:ss.ms) 0:00:39.000472 Currently retrieving data for baltimore: from 2018-09-01 to 2018-09-30 Time elapsed (hh:mm:ss.ms) 0:00:39.681709 Currently retrieving data for baltimore: from 2018-10-01 to 2018-10-31 Time elapsed (hh:mm:ss.ms) 0:00:40.378274 Currently retrieving data for baltimore: from 2018-11-01 to 2018-11-30 Time elapsed (hh:mm:ss.ms) 0:00:41.079340 Currently retrieving data for baltimore: from 2018-12-01 to 2018-12-31 Time elapsed (hh:mm:ss.ms) 0:00:41.741352 Currently retrieving data for baltimore: from 2019-01-01 to 2019-01-31 Time elapsed (hh:mm:ss.ms) 0:00:42.383933 Currently retrieving data for baltimore: from 2019-02-01 to 2019-02-28 Time elapsed (hh:mm:ss.ms) 0:00:42.986601 Currently retrieving data for baltimore: from 2019-03-01 to 2019-03-31 Time elapsed (hh:mm:ss.ms) 0:00:43.646396 Currently retrieving data for baltimore: from 2019-04-01 to 2019-04-30 Time elapsed (hh:mm:ss.ms) 0:00:44.346366 Currently retrieving data for baltimore: from 2019-05-01 to 2019-05-31 Time elapsed (hh:mm:ss.ms) 0:00:45.106360 Currently retrieving data for baltimore: from 2019-06-01 to 2019-06-30 Time elapsed (hh:mm:ss.ms) 0:00:45.865380 Currently retrieving data for baltimore: from 2019-07-01 to 2019-07-31 Time elapsed (hh:mm:ss.ms) 0:00:46.516650 Currently retrieving data for baltimore: from 2019-08-01 to 2019-08-31 Time elapsed (hh:mm:ss.ms) 0:00:47.203942 Currently retrieving data for baltimore: from 2019-09-01 to 2019-09-30 Time elapsed (hh:mm:ss.ms) 0:00:47.936649 Currently retrieving data for baltimore: from 2019-10-01 to 2019-10-31 Time elapsed (hh:mm:ss.ms) 0:00:48.661698 Currently retrieving data for baltimore: from 2019-11-01 to 2019-11-30 Time elapsed (hh:mm:ss.ms) 0:00:49.338724 Currently retrieving data for baltimore: from 2019-12-01 to 2019-12-31 Time elapsed (hh:mm:ss.ms) 0:00:49.995814 Currently retrieving data for baltimore: from 2020-01-01 to 2020-01-31 Time elapsed (hh:mm:ss.ms) 0:00:50.718250 Currently retrieving data for baltimore: from 2020-02-01 to 2020-02-29 Time elapsed (hh:mm:ss.ms) 0:00:51.431531 Currently retrieving data for baltimore: from 2020-03-01 to 2020-03-31 Time elapsed (hh:mm:ss.ms) 0:00:52.177988 Currently retrieving data for baltimore: from 2020-04-01 to 2020-04-30 Time elapsed (hh:mm:ss.ms) 0:00:52.898791 Currently retrieving data for baltimore: from 2020-05-01 to 2020-05-31 Time elapsed (hh:mm:ss.ms) 0:00:53.587990 Currently retrieving data for baltimore: from 2020-06-01 to 2020-06-30 Time elapsed (hh:mm:ss.ms) 0:00:54.338432 Currently retrieving data for baltimore: from 2020-07-01 to 2020-07-31 Time elapsed (hh:mm:ss.ms) 0:00:55.043529 Currently retrieving data for baltimore: from 2020-08-01 to 2020-08-31 Time elapsed (hh:mm:ss.ms) 0:00:55.779883 Currently retrieving data for baltimore: from 2020-09-01 to 2020-09-30 Time elapsed (hh:mm:ss.ms) 0:00:56.463137 Currently retrieving data for baltimore: from 2020-10-01 to 2020-10-31 Time elapsed (hh:mm:ss.ms) 0:00:57.372235 Currently retrieving data for baltimore: from 2020-11-01 to 2020-11-30 Time elapsed (hh:mm:ss.ms) 0:00:58.084567 Currently retrieving data for baltimore: from 2020-12-01 to 2020-12-31 Time elapsed (hh:mm:ss.ms) 0:00:58.740377 Currently retrieving data for baltimore: from 2021-01-01 to 2021-01-31 Time elapsed (hh:mm:ss.ms) 0:00:59.377895 Currently retrieving data for baltimore: from 2021-02-01 to 2021-02-28 Time elapsed (hh:mm:ss.ms) 0:00:59.989170 Currently retrieving data for baltimore: from 2021-03-01 to 2021-03-31 Time elapsed (hh:mm:ss.ms) 0:01:00.846414 Currently retrieving data for baltimore: from 2021-04-01 to 2021-04-30 Time elapsed (hh:mm:ss.ms) 0:01:01.552004 Currently retrieving data for baltimore: from 2021-05-01 to 2021-05-31 Time elapsed (hh:mm:ss.ms) 0:01:02.221932 Currently retrieving data for baltimore: from 2021-06-01 to 2021-06-30 Time elapsed (hh:mm:ss.ms) 0:01:02.890184 Currently retrieving data for baltimore: from 2021-07-01 to 2021-07-31 Time elapsed (hh:mm:ss.ms) 0:01:03.658304 Currently retrieving data for baltimore: from 2021-08-01 to 2021-08-31 Time elapsed (hh:mm:ss.ms) 0:01:04.344211 Currently retrieving data for baltimore: from 2021-09-01 to 2021-09-24 Time elapsed (hh:mm:ss.ms) 0:01:04.915060 export baltimore completed!
[ date_time maxtempC mintempC totalSnow_cm sunHour uvIndex \
0 2014-01-01 5 -2 0.0 8.7 2
0 2014-01-02 2 -5 1.2 7.0 1
0 2014-01-03 -9 -12 0.0 8.7 2
0 2014-01-04 -2 -10 0.0 8.7 2
0 2014-01-05 3 -4 0.0 3.5 1
.. ... ... ... ... ... ...
0 2021-09-20 30 18 0.0 12.5 6
0 2021-09-21 25 18 0.0 10.5 5
0 2021-09-22 25 20 0.0 11.8 5
0 2021-09-23 22 15 0.0 12.5 4
0 2021-09-24 23 13 0.0 12.4 5
moon_illumination moonrise moonset sunrise ... WindGustKmph \
0 1 07:15 AM 05:37 PM 07:27 AM ... 15
0 6 08:06 AM 06:49 PM 07:27 AM ... 17
0 13 08:52 AM 08:01 PM 07:27 AM ... 37
0 20 09:32 AM 09:13 PM 07:27 AM ... 20
0 26 10:08 AM 10:21 PM 07:27 AM ... 20
.. ... ... ... ... ... ...
0 92 06:23 PM 05:15 AM 05:53 AM ... 16
0 99 06:47 PM 06:19 AM 05:54 AM ... 16
0 95 07:11 PM 07:21 AM 05:54 AM ... 23
0 88 07:36 PM 08:22 AM 05:55 AM ... 23
0 81 08:02 PM 09:23 AM 05:56 AM ... 16
cloudcover humidity precipMM pressure tempC visibility winddirDegree \
0 22 69 0.0 1027 5 10 193
0 87 91 1.4 1012 2 4 83
0 32 84 0.0 1022 -9 10 287
0 3 74 0.0 1031 -2 10 156
0 78 94 8.4 1020 3 8 143
.. ... ... ... ... ... ... ...
0 2 60 0.0 1025 30 10 105
0 51 74 0.1 1025 25 10 104
0 60 86 2.5 1018 25 10 148
0 36 70 13.8 1014 22 9 291
0 0 56 0.0 1018 23 10 317
windspeedKmph location
0 9 baltimore
0 12 baltimore
0 25 baltimore
0 13 baltimore
0 10 baltimore
.. ... ...
0 12 baltimore
0 11 baltimore
0 16 baltimore
0 17 baltimore
0 11 baltimore
[2824 rows x 25 columns]]
| date_time | maxtempC | mintempC | totalSnow_cm | sunHour | uvIndex | moon_illumination | moonrise | moonset | sunrise | ... | WindGustKmph | cloudcover | humidity | precipMM | pressure | tempC | visibility | winddirDegree | windspeedKmph | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-01-01 | 5 | -2 | 0.0 | 8.7 | 2 | 1 | 07:15 AM | 05:37 PM | 07:27 AM | ... | 15 | 22 | 69 | 0.0 | 1027 | 5 | 10 | 193 | 9 | baltimore |
| 0 | 2014-01-02 | 2 | -5 | 1.2 | 7.0 | 1 | 6 | 08:06 AM | 06:49 PM | 07:27 AM | ... | 17 | 87 | 91 | 1.4 | 1012 | 2 | 4 | 83 | 12 | baltimore |
| 0 | 2014-01-03 | -9 | -12 | 0.0 | 8.7 | 2 | 13 | 08:52 AM | 08:01 PM | 07:27 AM | ... | 37 | 32 | 84 | 0.0 | 1022 | -9 | 10 | 287 | 25 | baltimore |
| 0 | 2014-01-04 | -2 | -10 | 0.0 | 8.7 | 2 | 20 | 09:32 AM | 09:13 PM | 07:27 AM | ... | 20 | 3 | 74 | 0.0 | 1031 | -2 | 10 | 156 | 13 | baltimore |
| 0 | 2014-01-05 | 3 | -4 | 0.0 | 3.5 | 1 | 26 | 10:08 AM | 10:21 PM | 07:27 AM | ... | 20 | 78 | 94 | 8.4 | 1020 | 3 | 8 | 143 | 10 | baltimore |
5 rows × 25 columns
<class 'pandas.core.frame.DataFrame'> Int64Index: 2824 entries, 0 to 0 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_time 2824 non-null datetime64[ns] 1 maxtempC 2824 non-null object 2 mintempC 2824 non-null object 3 totalSnow_cm 2824 non-null object 4 sunHour 2824 non-null object 5 uvIndex 2824 non-null object 6 moon_illumination 2824 non-null object 7 moonrise 2824 non-null object 8 moonset 2824 non-null object 9 sunrise 2824 non-null object 10 sunset 2824 non-null object 11 DewPointC 2824 non-null object 12 FeelsLikeC 2824 non-null object 13 HeatIndexC 2824 non-null object 14 WindChillC 2824 non-null object 15 WindGustKmph 2824 non-null object 16 cloudcover 2824 non-null object 17 humidity 2824 non-null object 18 precipMM 2824 non-null object 19 pressure 2824 non-null object 20 tempC 2824 non-null object 21 visibility 2824 non-null object 22 winddirDegree 2824 non-null object 23 windspeedKmph 2824 non-null object 24 location 2824 non-null object dtypes: datetime64[ns](1), object(24) memory usage: 573.6+ KB
pandas.core.frame.DataFrame
| date_time | maxtempC | mintempC | totalSnow_cm | sunHour | uvIndex | moon_illumination | moonrise | moonset | sunrise | ... | WindGustKmph | cloudcover | humidity | precipMM | pressure | tempC | visibility | winddirDegree | windspeedKmph | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-01-01 | 5 | -2 | 0.0 | 8.7 | 2 | 1 | 07:15 AM | 05:37 PM | 07:27 AM | ... | 15 | 22 | 69 | 0.0 | 1027 | 5 | 10 | 193 | 9 | baltimore |
| 0 | 2014-01-02 | 2 | -5 | 1.2 | 7.0 | 1 | 6 | 08:06 AM | 06:49 PM | 07:27 AM | ... | 17 | 87 | 91 | 1.4 | 1012 | 2 | 4 | 83 | 12 | baltimore |
| 0 | 2014-01-03 | -9 | -12 | 0.0 | 8.7 | 2 | 13 | 08:52 AM | 08:01 PM | 07:27 AM | ... | 37 | 32 | 84 | 0.0 | 1022 | -9 | 10 | 287 | 25 | baltimore |
| 0 | 2014-01-04 | -2 | -10 | 0.0 | 8.7 | 2 | 20 | 09:32 AM | 09:13 PM | 07:27 AM | ... | 20 | 3 | 74 | 0.0 | 1031 | -2 | 10 | 156 | 13 | baltimore |
| 0 | 2014-01-05 | 3 | -4 | 0.0 | 3.5 | 1 | 26 | 10:08 AM | 10:21 PM | 07:27 AM | ... | 20 | 78 | 94 | 8.4 | 1020 | 3 | 8 | 143 | 10 | baltimore |
5 rows × 25 columns
| date_time | maxtempC | mintempC | cloudcover | precipMM | visibility | |
|---|---|---|---|---|---|---|
| 0 | 2014-01-01 | 5 | -2 | 22 | 0.0 | 10 |
| 0 | 2014-01-02 | 2 | -5 | 87 | 1.4 | 4 |
| 0 | 2014-01-03 | -9 | -12 | 32 | 0.0 | 10 |
| 0 | 2014-01-04 | -2 | -10 | 3 | 0.0 | 10 |
| 0 | 2014-01-05 | 3 | -4 | 78 | 8.4 | 8 |
| ... | ... | ... | ... | ... | ... | ... |
| 0 | 2021-09-20 | 30 | 18 | 2 | 0.0 | 10 |
| 0 | 2021-09-21 | 25 | 18 | 51 | 0.1 | 10 |
| 0 | 2021-09-22 | 25 | 20 | 60 | 2.5 | 10 |
| 0 | 2021-09-23 | 22 | 15 | 36 | 13.8 | 9 |
| 0 | 2021-09-24 | 23 | 13 | 0 | 0.0 | 10 |
2824 rows × 6 columns
| RowID | CrimeCode | Location | Description | Weapon | Post | District | Neighborhood | Latitude | Longitude | ... | Outside | Inside_Outside_Null | isViolent | OtherCrime | isAuto | maxtempC | mintempC | cloudcover | precipMM | visibility | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 350294 | 2A | 3100 FERNDALE AVE | RAPE | OTHER | 622 | NORTHWEST | HOWARD PARK | 39.3269 | -76.7026 | ... | 0 | 0 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 |
| 1 | 350293 | 2A | 5400 ROLAND AVE | RAPE | OTHER | 534 | NORTHERN | ROLAND PARK | 39.3589 | -76.6353 | ... | 0 | 1 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 |
| 2 | 350292 | 2A | 2400 ST STEPHENS CT | RAPE | OTHER | 731 | WESTERN | MONDAWMIN | 39.3100 | -76.6571 | ... | 0 | 0 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 |
| 3 | 350291 | 2A | 4000 SPRINGDALE AVE | RAPE | OTHER | 621 | NORTHWEST | CENTRAL FOREST PARK | 39.3262 | -76.6872 | ... | 0 | 0 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 |
| 4 | 350290 | 2A | 4400 OLD FREDERICK RD | RAPE | OTHER | 822 | SOUTHWEST | UPLANDS | 39.2896 | -76.6913 | ... | 0 | 0 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 349523 | 3 | 6J | 400 N BRADFORD ST | LARCENY | NaN | 221 | SOUTHEAST | MCELDERRY PARK | 39.2961 | -76.5843 | ... | 0 | 1 | 0 | 1 | 0 | 22 | 15 | 36 | 13.8 | 9 |
| 349524 | 4 | 6J | 300 S EAST AVE | LARCENY | NaN | 225 | SOUTHEAST | HIGHLANDTOWN | 39.2881 | -76.5716 | ... | 0 | 1 | 0 | 1 | 0 | 23 | 13 | 0 | 0.0 | 10 |
| 349525 | 6 | 6D | 2900 KESWICK RD | LARCENY FROM AUTO | NaN | 511 | NORTHERN | HAMPDEN | 39.3226 | -76.6280 | ... | 0 | 1 | 0 | 0 | 1 | 23 | 13 | 0 | 0.0 | 10 |
| 349526 | 5 | 6D | 0 S CHARLES ST | LARCENY FROM AUTO | NaN | 114 | CENTRAL | DOWNTOWN | 39.2889 | -76.6150 | ... | 0 | 1 | 0 | 0 | 1 | 23 | 13 | 0 | 0.0 | 10 |
| 349527 | 1 | 6D | 500 SAINT PAUL ST APT 118 | LARCENY FROM AUTO | NaN | 124 | CENTRAL | MOUNT VERNON | 39.2959 | -76.6137 | ... | 0 | 1 | 0 | 0 | 1 | 23 | 13 | 0 | 0.0 | 10 |
349528 rows × 27 columns
In this section, we want to find if there is a peak in the occurence of crime during a certain month and also look for a trend for crime occurence along the months.
We also want to check if different seasons or month affect the category or type of crimes taking place.
So we first created three different variables with the number of crimes divided by categories (Violent, Automobile and Other crimes)and grouped them by the month and used that to plot a stacked bar graph using plotly.graph_objects module.
We want to understand whether crime occurrence changes over time within a day and if any peak time could be observed. We hypothesized that generally, the crime rate is higher during night time compared to the day time.
In order to see the pattern of crime occurrence in a day, we first group the data frame by hour. As above, we also classify all crime instances into three types based on description to gain further insight and display it in a stacked bar chart.
Type auto 17.075271 other crimes 54.946317 violent 27.978412 Name: percent, dtype: float64
Type auto 26.380261 other crimes 36.441219 violent 37.178520 Name: percent, dtype: float64
What's the density of occurences of violent crimes in Baltimore? Is there a cluster of crime occurences from which we can get inferences or trends?
One of the research questions of our project is to find the density of occurences of violent crimes at various locations in Baltimore. For this, we create a new dataframe called isviolent which has only violent crimes in it. We then use create a heatmap for the occurence of violent crimes using Folium's HeatMap() function. This heatmap will allow us to see the clustered data which can then be used to derive trends from the data.
| RowID | CrimeCode | Location | Description | Weapon | Post | District | Neighborhood | Latitude | Longitude | ... | date_time | Crime_Year | Crime_Date | Crime_Time | Inside | Outside | Inside_Outside_Null | isViolent | OtherCrime | isAuto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 7 | 1A | 3900 BELLE AVE | HOMICIDE | FIREARM | 623 | NORTHWEST | CALLAWAY-GARRISON | 39.3332 | -76.6821 | ... | 2021-09-22 05:23:43 | 2021 | 2021-09-22 | 05:23:43 | 0 | 1 | 0 | 1 | 0 | 0 |
| 9 | 10 | 4E | 5100 ARBUTUS AVE | COMMON ASSAULT | NaN | 613 | NORTHWEST | LANGSTON HUGHES | 39.3436 | -76.6815 | ... | 2021-09-22 02:04:10 | 2021 | 2021-09-22 | 02:04:10 | 0 | 1 | 0 | 1 | 0 | 0 |
| 12 | 13 | 4E | 800 W LAKE AVE | COMMON ASSAULT | NaN | 521 | NORTHERN | NORTH ROLAND PARK/POPLAR HILL | 39.3694 | -76.6342 | ... | 2021-09-22 09:09:37 | 2021 | 2021-09-22 | 09:09:37 | 1 | 0 | 0 | 1 | 0 | 0 |
| 13 | 14 | 4E | 2900 GARRISON BLVD | COMMON ASSAULT | NaN | 624 | NORTHWEST | GARWYN OAKS | 39.3203 | -76.6773 | ... | 2021-09-22 00:37:27 | 2021 | 2021-09-22 | 00:37:27 | 1 | 0 | 0 | 1 | 0 | 0 |
| 14 | 15 | 4C | 1300 W NORTHERN PKWY | AGG. ASSAULT | OTHER | 533 | NORTHERN | SABINA-MATTFELDT | 39.3612 | -76.6480 | ... | 2021-09-22 08:41:19 | 2021 | 2021-09-22 | 08:41:19 | 1 | 0 | 0 | 1 | 0 | 0 |
5 rows × 22 columns
We observe that there are a lot of violent crimes occcuring in and around Charles Center as seen from the warm colour spectrum of the heatmap. We can infer that this is because it is in the popular downtown business district of Baltimore. It has a lot of people coming to work in this district which results in a lot of footfalll of people who become a victim of various violent crimes.
<ipython-input-210-62b5fb253338>:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| RowID | CrimeCode | Location | Description | Weapon | Post | District | Neighborhood | Latitude | Longitude | ... | Inside_Outside_Null | isViolent | OtherCrime | isAuto | maxtempC | mintempC | cloudcover | precipMM | visibility | avgtempC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 106 | 350157 | 2B | 2900 FENDALL RD | RAPE | OTHER | 622 | NORTHWEST | HOWARD PARK | 39.3258 | -76.7056 | ... | 1 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 |
| 108 | 350096 | 4B | 1200 UNION AVE | AGG. ASSAULT | KNIFE | 531 | NORTHERN | HAMPDEN | 39.3332 | -76.6362 | ... | 0 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 |
| 111 | 350165 | 4B | 1200 UNION AVE | AGG. ASSAULT | KNIFE | 531 | NORTHERN | HAMPDEN | 39.3332 | -76.6362 | ... | 0 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 |
| 116 | 350072 | 4D | 200 N CHARLES ST | AGG. ASSAULT | HANDS | 114 | CENTRAL | DOWNTOWN | 39.2918 | -76.6152 | ... | 0 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 |
| 117 | 350182 | 4E | 700 ALICEANNA ST | COMMON ASSAULT | NaN | 211 | SOUTHEAST | INNER HARBOR | 39.2831 | -76.6028 | ... | 0 | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 349503 | 24 | 4E | 2700 BEREA RD | COMMON ASSAULT | NaN | 922 | SOUTHERN | CHERRY HILL | 39.2503 | -76.6332 | ... | 0 | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 |
| 349504 | 7 | 1A | 3900 BELLE AVE | HOMICIDE | FIREARM | 623 | NORTHWEST | CALLAWAY-GARRISON | 39.3332 | -76.6821 | ... | 0 | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 |
| 349513 | 22 | 4C | 600 N POTOMAC ST | AGG. ASSAULT | OTHER | 224 | SOUTHEAST | ELLWOOD PARK/MONUMENT | 39.2987 | -76.5751 | ... | 0 | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 |
| 349514 | 15 | 4C | 1300 W NORTHERN PKWY | AGG. ASSAULT | OTHER | 533 | NORTHERN | SABINA-MATTFELDT | 39.3612 | -76.6480 | ... | 0 | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 |
| 349519 | 13 | 4E | 800 W LAKE AVE | COMMON ASSAULT | NaN | 521 | NORTHERN | NORTH ROLAND PARK/POPLAR HILL | 39.3694 | -76.6342 | ... | 0 | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 |
111556 rows × 28 columns
24.5 3176
22.5 3062
26.0 2910
24.0 2774
23.5 2773
...
-11.0 48
-12.0 44
-9.5 43
-14.0 39
-12.5 38
Name: avgtempC, Length: 90, dtype: int64
Because there are 90 different average temperatures in the data, we decide to divide them into 15 bins.
array([-14. , -10.8, -7.6, -4.4, -1.2, 2. , 5.2, 8.4, 11.6,
14.8, 18. , 21.2, 24.4, 27.6, 30.8, 34. ])
<ipython-input-213-0af186c38a10>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| RowID | CrimeCode | Location | Description | Weapon | Post | District | Neighborhood | Latitude | Longitude | ... | isViolent | OtherCrime | isAuto | maxtempC | mintempC | cloudcover | precipMM | visibility | avgtempC | avgtempC_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 106 | 350157 | 2B | 2900 FENDALL RD | RAPE | OTHER | 622 | NORTHWEST | HOWARD PARK | 39.3258 | -76.7056 | ... | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 | (-1.2, 2.0] |
| 108 | 350096 | 4B | 1200 UNION AVE | AGG. ASSAULT | KNIFE | 531 | NORTHERN | HAMPDEN | 39.3332 | -76.6362 | ... | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 | (-1.2, 2.0] |
| 111 | 350165 | 4B | 1200 UNION AVE | AGG. ASSAULT | KNIFE | 531 | NORTHERN | HAMPDEN | 39.3332 | -76.6362 | ... | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 | (-1.2, 2.0] |
| 116 | 350072 | 4D | 200 N CHARLES ST | AGG. ASSAULT | HANDS | 114 | CENTRAL | DOWNTOWN | 39.2918 | -76.6152 | ... | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 | (-1.2, 2.0] |
| 117 | 350182 | 4E | 700 ALICEANNA ST | COMMON ASSAULT | NaN | 211 | SOUTHEAST | INNER HARBOR | 39.2831 | -76.6028 | ... | 1 | 0 | 0 | 5 | -2 | 22 | 0.0 | 10 | 1.5 | (-1.2, 2.0] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 349503 | 24 | 4E | 2700 BEREA RD | COMMON ASSAULT | NaN | 922 | SOUTHERN | CHERRY HILL | 39.2503 | -76.6332 | ... | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 | (21.2, 24.4] |
| 349504 | 7 | 1A | 3900 BELLE AVE | HOMICIDE | FIREARM | 623 | NORTHWEST | CALLAWAY-GARRISON | 39.3332 | -76.6821 | ... | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 | (21.2, 24.4] |
| 349513 | 22 | 4C | 600 N POTOMAC ST | AGG. ASSAULT | OTHER | 224 | SOUTHEAST | ELLWOOD PARK/MONUMENT | 39.2987 | -76.5751 | ... | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 | (21.2, 24.4] |
| 349514 | 15 | 4C | 1300 W NORTHERN PKWY | AGG. ASSAULT | OTHER | 533 | NORTHERN | SABINA-MATTFELDT | 39.3612 | -76.6480 | ... | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 | (21.2, 24.4] |
| 349519 | 13 | 4E | 800 W LAKE AVE | COMMON ASSAULT | NaN | 521 | NORTHERN | NORTH ROLAND PARK/POPLAR HILL | 39.3694 | -76.6342 | ... | 1 | 0 | 0 | 25 | 20 | 60 | 2.5 | 10 | 22.5 | (21.2, 24.4] |
111556 rows × 29 columns
(-14.0, -10.8] 187 (-10.8, -7.6] 654 (-7.6, -4.4] 1329 (-4.4, -1.2] 3180 (-1.2, 2.0] 6667 (2.0, 5.2] 9648 (5.2, 8.4] 9789 (8.4, 11.6] 11432 (11.6, 14.8] 9214 (14.8, 18.0] 10501 (18.0, 21.2] 10626 (21.2, 24.4] 15055 (24.4, 27.6] 17452 (27.6, 30.8] 5348 (30.8, 34.0] 435 Name: avgtempC_bin, dtype: int64
Because we noticed that there will be more than one crime happening in a single day, we decided to standardize the data by using the numbers of each date. For this purpose, we have to drop the duplicated date and to count how many date there are in the whole dataset.
(-14.0, -10.8] 8 (-10.8, -7.6] 23 (-7.6, -4.4] 47 (-4.4, -1.2] 102 (-1.2, 2.0] 195 (2.0, 5.2] 264 (5.2, 8.4] 269 (8.4, 11.6] 298 (11.6, 14.8] 230 (14.8, 18.0] 256 (18.0, 21.2] 249 (21.2, 24.4] 351 (24.4, 27.6] 395 (27.6, 30.8] 122 (30.8, 34.0] 11 Name: avgtempC_bin, dtype: int64
We decided to create a dataframe only shows the total numbers of the crimes in each bin, so there are several steps to take.
<ipython-input-216-ca93e8294edd>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy <ipython-input-216-ca93e8294edd>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy <ipython-input-216-ca93e8294edd>:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy <ipython-input-216-ca93e8294edd>:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy <ipython-input-216-ca93e8294edd>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy <ipython-input-216-ca93e8294edd>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| index | RowID | Latitude | Longitude | Crime_Year | Inside | Outside | Inside_Outside_Null | isViolent | OtherCrime | isAuto | avgtempC | isRape | isAggAssault | isCommonAssault | isArson | isShooting | isHomicide | avgtempC_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-14.0, -10.8] | 53433665 | 7350.4428 | -1.432713e+04 | 376863 | 137 | 40 | 10 | 187 | 0 | 0 | -2186.5 | 6 | 57 | 117 | 2 | 2 | 3 | 8 |
| 1 | (-10.8, -7.6] | 168182041 | 25706.2839 | -5.010992e+04 | 1318374 | 374 | 186 | 94 | 654 | 0 | 0 | -5924.0 | 18 | 201 | 399 | 7 | 23 | 6 | 23 |
| 2 | (-7.6, -4.4] | 327826164 | 52238.0021 | -1.018282e+05 | 2679275 | 755 | 406 | 168 | 1329 | 0 | 0 | -7569.0 | 27 | 456 | 772 | 14 | 38 | 22 | 47 |
| 3 | (-4.4, -1.2] | 627325966 | 124996.7583 | -2.436481e+05 | 6414103 | 1849 | 912 | 419 | 3180 | 0 | 0 | -8078.0 | 80 | 1115 | 1797 | 44 | 84 | 60 | 102 |
| 4 | (-1.2, 2.0] | 1166223325 | 262052.9030 | -5.108217e+05 | 13449693 | 3663 | 2150 | 854 | 6667 | 0 | 0 | 4675.0 | 154 | 2414 | 3698 | 75 | 224 | 102 | 195 |
| 5 | (2.0, 5.2] | 1468925661 | 379217.6379 | -7.392382e+05 | 19467319 | 5054 | 3169 | 1425 | 9648 | 0 | 0 | 36025.0 | 206 | 3363 | 5392 | 109 | 369 | 209 | 264 |
| 6 | (5.2, 8.4] | 1608436204 | 384755.9263 | -7.500386e+05 | 19748592 | 5003 | 3391 | 1395 | 9789 | 0 | 0 | 66150.5 | 189 | 3457 | 5476 | 122 | 357 | 188 | 269 |
| 7 | (8.4, 11.6] | 1837230389 | 449335.4623 | -8.759203e+05 | 23064218 | 5710 | 4243 | 1479 | 11432 | 0 | 0 | 116061.0 | 223 | 4081 | 6333 | 146 | 412 | 237 | 298 |
| 8 | (11.6, 14.8] | 1610806936 | 362163.4213 | -7.059711e+05 | 18586941 | 4575 | 3611 | 1028 | 9214 | 0 | 0 | 122698.5 | 219 | 3347 | 4937 | 149 | 381 | 181 | 230 |
| 9 | (14.8, 18.0] | 1777649170 | 412745.3298 | -8.045890e+05 | 21183677 | 5060 | 4059 | 1382 | 10501 | 0 | 0 | 173476.5 | 240 | 3752 | 5709 | 154 | 445 | 201 | 256 |
| 10 | (18.0, 21.2] | 1778022046 | 417658.1775 | -8.141564e+05 | 21436072 | 4673 | 4390 | 1563 | 10626 | 0 | 0 | 210375.0 | 222 | 3815 | 5773 | 145 | 452 | 219 | 249 |
| 11 | (21.2, 24.4] | 2645048717 | 591735.0143 | -1.153509e+06 | 30368381 | 6536 | 6636 | 1883 | 15055 | 0 | 0 | 343650.5 | 317 | 5650 | 7897 | 229 | 688 | 274 | 351 |
| 12 | (24.4, 27.6] | 2476663484 | 685945.5201 | -1.337172e+06 | 35215465 | 6896 | 7532 | 3024 | 17452 | 0 | 0 | 450735.5 | 357 | 6565 | 9051 | 208 | 871 | 400 | 395 |
| 13 | (27.6, 30.8] | 524815203 | 210198.0605 | -4.097672e+05 | 10796469 | 2024 | 2323 | 1001 | 5348 | 0 | 0 | 154708.5 | 117 | 2049 | 2728 | 35 | 300 | 119 | 122 |
| 14 | (30.8, 34.0] | 33459243 | 17096.0640 | -3.333171e+04 | 878361 | 186 | 165 | 84 | 435 | 0 | 0 | 13581.0 | 13 | 166 | 228 | 1 | 19 | 8 | 11 |
| index | isRape | isAggAssault | isCommonAssault | isArson | isShooting | isHomicide | avgtempC_bin | |
|---|---|---|---|---|---|---|---|---|
| 0 | (-14.0, -10.8] | 6 | 57 | 117 | 2 | 2 | 3 | 8 |
| 1 | (-10.8, -7.6] | 18 | 201 | 399 | 7 | 23 | 6 | 23 |
| 2 | (-7.6, -4.4] | 27 | 456 | 772 | 14 | 38 | 22 | 47 |
| 3 | (-4.4, -1.2] | 80 | 1115 | 1797 | 44 | 84 | 60 | 102 |
| 4 | (-1.2, 2.0] | 154 | 2414 | 3698 | 75 | 224 | 102 | 195 |
| 5 | (2.0, 5.2] | 206 | 3363 | 5392 | 109 | 369 | 209 | 264 |
| 6 | (5.2, 8.4] | 189 | 3457 | 5476 | 122 | 357 | 188 | 269 |
| 7 | (8.4, 11.6] | 223 | 4081 | 6333 | 146 | 412 | 237 | 298 |
| 8 | (11.6, 14.8] | 219 | 3347 | 4937 | 149 | 381 | 181 | 230 |
| 9 | (14.8, 18.0] | 240 | 3752 | 5709 | 154 | 445 | 201 | 256 |
| 10 | (18.0, 21.2] | 222 | 3815 | 5773 | 145 | 452 | 219 | 249 |
| 11 | (21.2, 24.4] | 317 | 5650 | 7897 | 229 | 688 | 274 | 351 |
| 12 | (24.4, 27.6] | 357 | 6565 | 9051 | 208 | 871 | 400 | 395 |
| 13 | (27.6, 30.8] | 117 | 2049 | 2728 | 35 | 300 | 119 | 122 |
| 14 | (30.8, 34.0] | 13 | 166 | 228 | 1 | 19 | 8 | 11 |
| Average_Temperature_Bin | isRape | isAggAssault | isCommonAssault | isArson | isShooting | isHomicide | Date_Counts_Within_The_Temp_Bin | |
|---|---|---|---|---|---|---|---|---|
| 0 | (-14.0, -10.8] | 6 | 57 | 117 | 2 | 2 | 3 | 8 |
| 1 | (-10.8, -7.6] | 18 | 201 | 399 | 7 | 23 | 6 | 23 |
| 2 | (-7.6, -4.4] | 27 | 456 | 772 | 14 | 38 | 22 | 47 |
| 3 | (-4.4, -1.2] | 80 | 1115 | 1797 | 44 | 84 | 60 | 102 |
| 4 | (-1.2, 2.0] | 154 | 2414 | 3698 | 75 | 224 | 102 | 195 |
| 5 | (2.0, 5.2] | 206 | 3363 | 5392 | 109 | 369 | 209 | 264 |
| 6 | (5.2, 8.4] | 189 | 3457 | 5476 | 122 | 357 | 188 | 269 |
| 7 | (8.4, 11.6] | 223 | 4081 | 6333 | 146 | 412 | 237 | 298 |
| 8 | (11.6, 14.8] | 219 | 3347 | 4937 | 149 | 381 | 181 | 230 |
| 9 | (14.8, 18.0] | 240 | 3752 | 5709 | 154 | 445 | 201 | 256 |
| 10 | (18.0, 21.2] | 222 | 3815 | 5773 | 145 | 452 | 219 | 249 |
| 11 | (21.2, 24.4] | 317 | 5650 | 7897 | 229 | 688 | 274 | 351 |
| 12 | (24.4, 27.6] | 357 | 6565 | 9051 | 208 | 871 | 400 | 395 |
| 13 | (27.6, 30.8] | 117 | 2049 | 2728 | 35 | 300 | 119 | 122 |
| 14 | (30.8, 34.0] | 13 | 166 | 228 | 1 | 19 | 8 | 11 |
The dataframe above shows the total number of each crime in each temperature bin.
For the visualization of teperature-sensitiveness, the next step is to standardize and percentize the data by using the numbers of date counts in each temperature bin.
| Average_Temperature_Bin | isRape | isAggAssault | isCommonAssault | isArson | isShooting | isHomicide | Date_Counts_Within_The_Temp_Bin | RapeRate | AggAssaultRate | CommonAssaultRate | ArsonRate | ShootingRate | HomicideRate | RapeRate% | AggAssaultRate% | CommonAssaultRate% | ArsonRate% | ShootingRate% | HomicideRate% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-14.0, -10.8] | 6 | 57 | 117 | 2 | 2 | 3 | 8 | 0.750000 | 7.125000 | 14.625000 | 0.250000 | 0.250000 | 0.375000 | 5.932902 | 3.611352 | 4.835285 | 3.898290 | 1.152089 | 3.588957 |
| 1 | (-10.8, -7.6] | 18 | 201 | 399 | 7 | 23 | 6 | 23 | 0.782609 | 8.739130 | 17.347826 | 0.304348 | 1.000000 | 0.260870 | 6.190855 | 4.429485 | 5.735500 | 4.745745 | 4.608358 | 2.496666 |
| 2 | (-7.6, -4.4] | 27 | 456 | 772 | 14 | 38 | 22 | 47 | 0.574468 | 9.702128 | 16.425532 | 0.297872 | 0.808511 | 0.468085 | 4.544351 | 4.917586 | 5.430573 | 4.644772 | 3.725906 | 4.479833 |
| 3 | (-4.4, -1.2] | 80 | 1115 | 1797 | 44 | 84 | 60 | 102 | 0.784314 | 10.931373 | 17.617647 | 0.431373 | 0.823529 | 0.588235 | 6.204342 | 5.540637 | 5.824707 | 6.726462 | 3.795118 | 5.629736 |
| 4 | (-1.2, 2.0] | 154 | 2414 | 3698 | 75 | 224 | 102 | 195 | 0.789744 | 12.379487 | 18.964103 | 0.384615 | 1.148718 | 0.523077 | 6.247295 | 6.274623 | 6.269870 | 5.997370 | 5.293703 | 5.006135 |
| 5 | (2.0, 5.2] | 206 | 3363 | 5392 | 109 | 369 | 209 | 264 | 0.780303 | 12.738636 | 20.424242 | 0.412879 | 1.397727 | 0.791667 | 6.172616 | 6.456660 | 6.752618 | 6.438086 | 6.441227 | 7.576687 |
| 6 | (5.2, 8.4] | 189 | 3457 | 5476 | 122 | 357 | 188 | 269 | 0.702602 | 12.851301 | 20.356877 | 0.453532 | 1.327138 | 0.698885 | 5.557961 | 6.513765 | 6.730346 | 7.071992 | 6.115925 | 6.688713 |
| 7 | (8.4, 11.6] | 223 | 4081 | 6333 | 146 | 412 | 237 | 298 | 0.748322 | 13.694631 | 21.251678 | 0.489933 | 1.382550 | 0.795302 | 5.919630 | 6.941212 | 7.026183 | 7.639603 | 6.371287 | 7.611479 |
| 8 | (11.6, 14.8] | 219 | 3347 | 4937 | 149 | 381 | 181 | 230 | 0.952174 | 14.552174 | 21.465217 | 0.647826 | 1.656522 | 0.786957 | 7.532206 | 7.375864 | 7.096783 | 10.101657 | 7.633845 | 7.531608 |
| 9 | (14.8, 18.0] | 240 | 3752 | 5709 | 154 | 445 | 201 | 256 | 0.937500 | 14.656250 | 22.300781 | 0.601562 | 1.738281 | 0.785156 | 7.416128 | 7.428615 | 7.373035 | 9.380261 | 8.010622 | 7.514378 |
| 10 | (18.0, 21.2] | 222 | 3815 | 5773 | 145 | 452 | 219 | 249 | 0.891566 | 15.321285 | 23.184739 | 0.582329 | 1.815261 | 0.879518 | 7.052767 | 7.765693 | 7.665287 | 9.080355 | 8.365373 | 8.417473 |
| 11 | (21.2, 24.4] | 317 | 5650 | 7897 | 229 | 688 | 274 | 351 | 0.903134 | 16.096866 | 22.498575 | 0.652422 | 1.960114 | 0.780627 | 7.144274 | 8.158801 | 7.438429 | 10.173316 | 9.032907 | 7.471029 |
| 12 | (24.4, 27.6] | 357 | 6565 | 9051 | 208 | 871 | 400 | 395 | 0.903797 | 16.620253 | 22.913924 | 0.526582 | 2.205063 | 1.012658 | 7.149523 | 8.424083 | 7.575751 | 8.211083 | 10.161721 | 9.691698 |
| 13 | (27.6, 30.8] | 117 | 2049 | 2728 | 35 | 300 | 119 | 122 | 0.959016 | 16.795082 | 22.360656 | 0.286885 | 2.459016 | 0.975410 | 7.586334 | 8.512696 | 7.392830 | 4.473448 | 11.332028 | 9.335210 |
| 14 | (30.8, 34.0] | 13 | 166 | 228 | 1 | 19 | 8 | 11 | 1.181818 | 15.090909 | 20.727273 | 0.090909 | 1.727273 | 0.727273 | 9.348816 | 7.648925 | 6.852805 | 1.417560 | 7.959891 | 6.960401 |
| Average_Temperature_Bin | RapeRate% | AggAssaultRate% | CommonAssaultRate% | ArsonRate% | ShootingRate% | HomicideRate% | |
|---|---|---|---|---|---|---|---|
| 0 | (-14.0, -10.8] | 5.932902 | 3.611352 | 4.835285 | 3.898290 | 1.152089 | 3.588957 |
| 1 | (-10.8, -7.6] | 6.190855 | 4.429485 | 5.735500 | 4.745745 | 4.608358 | 2.496666 |
| 2 | (-7.6, -4.4] | 4.544351 | 4.917586 | 5.430573 | 4.644772 | 3.725906 | 4.479833 |
| 3 | (-4.4, -1.2] | 6.204342 | 5.540637 | 5.824707 | 6.726462 | 3.795118 | 5.629736 |
| 4 | (-1.2, 2.0] | 6.247295 | 6.274623 | 6.269870 | 5.997370 | 5.293703 | 5.006135 |
| 5 | (2.0, 5.2] | 6.172616 | 6.456660 | 6.752618 | 6.438086 | 6.441227 | 7.576687 |
| 6 | (5.2, 8.4] | 5.557961 | 6.513765 | 6.730346 | 7.071992 | 6.115925 | 6.688713 |
| 7 | (8.4, 11.6] | 5.919630 | 6.941212 | 7.026183 | 7.639603 | 6.371287 | 7.611479 |
| 8 | (11.6, 14.8] | 7.532206 | 7.375864 | 7.096783 | 10.101657 | 7.633845 | 7.531608 |
| 9 | (14.8, 18.0] | 7.416128 | 7.428615 | 7.373035 | 9.380261 | 8.010622 | 7.514378 |
| 10 | (18.0, 21.2] | 7.052767 | 7.765693 | 7.665287 | 9.080355 | 8.365373 | 8.417473 |
| 11 | (21.2, 24.4] | 7.144274 | 8.158801 | 7.438429 | 10.173316 | 9.032907 | 7.471029 |
| 12 | (24.4, 27.6] | 7.149523 | 8.424083 | 7.575751 | 8.211083 | 10.161721 | 9.691698 |
| 13 | (27.6, 30.8] | 7.586334 | 8.512696 | 7.392830 | 4.473448 | 11.332028 | 9.335210 |
| 14 | (30.8, 34.0] | 9.348816 | 7.648925 | 6.852805 | 1.417560 | 7.959891 | 6.960401 |
We'd consider 4 major violent crimes in our analysis to understand the effects of varying temperature. Based on our hypothesis, the number of violent crimes were expected to be more in winter, however, after analysis, we're able to conclude that it is not the case as the crime rate increases with temperature.
Usually, 'Assault' crimes require contact between offender and victim. From our analysis, we could infer that contact between them is more likely at certain temperatures when people are outdoors and less likely at others which is why assaults increase during summer.
When it comes to 'Shooting', it is more temperature sensitive compared to others with almost constant positive slope throughout the temperature range. This implies that comfortable weather conditions may be associated with increased outdoor activity and increased opportunities for potential offenders to identify victims for shooting.
Based on our analysis, colder to medium cold weather has more influence on arson crimes than hot weather and also there's some evidence from the graph that extremely hot and cold temperatures are associated with less arson crime rate. This might be because the potential for ignition is lesser during extreme cold conditions and also, the chances of ignition during the summer holidays when people are indoors is less.
In our dataset, we have columns for the Latitude and the Longitude. We will use this to form clusters of areas based on the Latitude and Longitude. For this, we will use the K-means clustering algorithm which uses the centroid using the average euclidean distance.
0
| RowID | Latitude | Longitude | |
|---|---|---|---|
| 0 | 1 | 39.2959 | -76.6137 |
| 1 | 2 | 39.2922 | -76.5891 |
| 2 | 3 | 39.2961 | -76.5843 |
| 3 | 4 | 39.2881 | -76.5716 |
| 4 | 5 | 39.2889 | -76.6150 |
127
| RowID | Latitude | Longitude | cluster_label | |
|---|---|---|---|---|
| 0 | 1 | 39.2959 | -76.6137 | 76 |
| 1 | 2 | 39.2922 | -76.5891 | 104 |
| 2 | 3 | 39.2961 | -76.5843 | 22 |
| 3 | 4 | 39.2881 | -76.5716 | 4 |
| 4 | 5 | 39.2889 | -76.6150 | 31 |
| 5 | 6 | 39.3226 | -76.6280 | 85 |
| 6 | 7 | 39.3332 | -76.6821 | 10 |
| 7 | 8 | 39.2877 | -76.5679 | 4 |
| 8 | 9 | 39.2786 | -76.6126 | 42 |
| 9 | 10 | 39.3436 | -76.6815 | 10 |
814 1589 721 1044 835 348 Name: Post, dtype: int64
<Figure size 1600x1600 with 0 Axes>
<Figure size 1600x1600 with 0 Axes>
| RowID | CrimeCode | Location | Description | Weapon | Post | District | Neighborhood | Latitude_x | Longitude_x | ... | Crime_Time | Inside | Outside | Inside_Outside_Null | isViolent | OtherCrime | isAuto | Latitude_y | Longitude_y | cluster_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 349527 | 350294 | 2A | 3100 FERNDALE AVE | RAPE | OTHER | 622 | NORTHWEST | HOWARD PARK | 39.3269 | -76.7026 | ... | 00:00:00 | 1 | 0 | 0 | 1 | 0 | 0 | 39.3269 | -76.7026 | 51 |
| 349526 | 350293 | 2A | 5400 ROLAND AVE | RAPE | OTHER | 534 | NORTHERN | ROLAND PARK | 39.3589 | -76.6353 | ... | 21:00:00 | 0 | 0 | 1 | 1 | 0 | 0 | 39.3589 | -76.6353 | 84 |
| 349525 | 350292 | 2A | 2400 ST STEPHENS CT | RAPE | OTHER | 731 | WESTERN | MONDAWMIN | 39.3100 | -76.6571 | ... | 00:01:00 | 1 | 0 | 0 | 1 | 0 | 0 | 39.3100 | -76.6571 | 60 |
| 349524 | 350291 | 2A | 4000 SPRINGDALE AVE | RAPE | OTHER | 621 | NORTHWEST | CENTRAL FOREST PARK | 39.3262 | -76.6872 | ... | 23:00:00 | 1 | 0 | 0 | 1 | 0 | 0 | 39.3262 | -76.6872 | 110 |
| 349523 | 350290 | 2A | 4400 OLD FREDERICK RD | RAPE | OTHER | 822 | SOUTHWEST | UPLANDS | 39.2896 | -76.6913 | ... | 00:00:00 | 1 | 0 | 0 | 1 | 0 | 0 | 39.2896 | -76.6913 | 7 |
| 349522 | 350289 | 2A | 600 W 34TH ST | RAPE | OTHER | 531 | NORTHERN | WYMAN PARK | 39.3288 | -76.6269 | ... | 00:01:00 | 1 | 0 | 0 | 1 | 0 | 0 | 39.3288 | -76.6269 | 124 |
| 349520 | 350287 | 2A | 4300 PARK HEIGHTS AVE | RAPE | OTHER | 614 | NORTHWEST | CENTRAL PARK HEIGHTS | 39.3389 | -76.6657 | ... | 00:00:00 | 1 | 0 | 0 | 1 | 0 | 0 | 39.3389 | -76.6657 | 33 |
| 349519 | 350286 | 2A | 1900 ARGONNE DR | RAPE | OTHER | 421 | NORTHEAST | MORGAN STATE UNIVERSITY | 39.3405 | -76.5821 | ... | 10:30:00 | 1 | 0 | 0 | 1 | 0 | 0 | 39.3405 | -76.5821 | 79 |
| 349521 | 350288 | 2A | 3600 W BELVEDERE AVE | RAPE | OTHER | 633 | NORTHWEST | ARLINGTON | 39.3467 | -76.6798 | ... | 00:00:00 | 0 | 0 | 1 | 1 | 0 | 0 | 39.3467 | -76.6798 | 68 |
| 349518 | 350285 | 2A | 800 BENNINGHAUS RD | RAPE | OTHER | 523 | NORTHERN | BELVEDERE | 39.3600 | -76.6040 | ... | 00:01:00 | 1 | 0 | 0 | 1 | 0 | 0 | 39.3600 | -76.6040 | 106 |
10 rows × 25 columns
<class 'pandas.core.frame.DataFrame'> Int64Index: 349528 entries, 349527 to 0 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RowID 349528 non-null int64 1 CrimeCode 349528 non-null object 2 Location 347901 non-null object 3 Description 349528 non-null object 4 Weapon 75860 non-null object 5 Post 349528 non-null object 6 District 349528 non-null object 7 Neighborhood 349499 non-null object 8 Latitude_x 349528 non-null float64 9 Longitude_x 349528 non-null float64 10 GeoLocation 349528 non-null object 11 Premise 301711 non-null object 12 date_time 349528 non-null datetime64[ns] 13 Crime_Year 349528 non-null int64 14 Crime_Date 349528 non-null object 15 Crime_Time 349528 non-null object 16 Inside 349528 non-null int64 17 Outside 349528 non-null int64 18 Inside_Outside_Null 349528 non-null int64 19 isViolent 349528 non-null int64 20 OtherCrime 349528 non-null int64 21 isAuto 349528 non-null int64 22 Latitude_y 349528 non-null float64 23 Longitude_y 349528 non-null float64 24 cluster_label 349528 non-null int32 dtypes: datetime64[ns](1), float64(4), int32(1), int64(8), object(11) memory usage: 68.0+ MB
There are a lot of feature columns that need to be calculated for the prediction model. We calculate the following columns:
<class 'pandas.core.frame.DataFrame'> MultiIndex: 194537 entries, (0, datetime.date(2014, 1, 1)) to (126, datetime.date(2021, 9, 20)) Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sum120 194537 non-null float64 dtypes: float64(1) memory usage: 2.2 MB
| Sum120 | ||
|---|---|---|
| cluster_label | Crime_Date | |
| 0 | 2014-01-01 | 1.0 |
| 2014-01-02 | 1.0 | |
| 2014-01-04 | 1.0 | |
| 2014-01-05 | 2.0 | |
| 2014-01-06 | 2.0 | |
| 2014-01-07 | 2.0 | |
| 2014-01-08 | 2.0 | |
| 2014-01-10 | 2.0 | |
| 2014-01-12 | 3.0 | |
| 2014-01-13 | 3.0 | |
| 2014-01-15 | 3.0 | |
| 2014-01-16 | 3.0 | |
| 2014-01-17 | 5.0 | |
| 2014-01-18 | 7.0 | |
| 2014-01-19 | 7.0 | |
| 2014-01-20 | 7.0 | |
| 2014-01-22 | 7.0 | |
| 2014-01-23 | 7.0 | |
| 2014-01-24 | 7.0 | |
| 2014-01-25 | 7.0 | |
| 2014-01-26 | 7.0 | |
| 2014-01-27 | 9.0 | |
| 2014-01-28 | 9.0 | |
| 2014-01-30 | 9.0 | |
| 2014-02-01 | 11.0 | |
| 2014-02-02 | 12.0 | |
| 2014-02-03 | 12.0 | |
| 2014-02-04 | 12.0 | |
| 2014-02-05 | 12.0 | |
| 2014-02-06 | 12.0 | |
| 2014-02-07 | 12.0 | |
| 2014-02-08 | 12.0 | |
| 2014-02-09 | 12.0 | |
| 2014-02-10 | 12.0 | |
| 2014-02-11 | 13.0 | |
| 2014-02-12 | 13.0 | |
| 2014-02-13 | 13.0 | |
| 2014-02-14 | 13.0 | |
| 2014-02-15 | 14.0 | |
| 2014-02-17 | 14.0 | |
| 2014-02-18 | 14.0 | |
| 2014-02-19 | 14.0 | |
| 2014-02-20 | 14.0 | |
| 2014-02-21 | 15.0 | |
| 2014-02-22 | 16.0 | |
| 2014-02-23 | 16.0 | |
| 2014-02-26 | 16.0 | |
| 2014-02-28 | 16.0 | |
| 2014-03-01 | 16.0 | |
| 2014-03-02 | 18.0 |
| Sum120 | Sum30 | Sum7 | Sum1 | SumAutoRelated120 | SumAutoRelated30 | SumAutoRelated7 | SumAutoRelated1 | SumOther120 | SumOther30 | SumOther7 | SumOther1 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cluster_label | Crime_Date | ||||||||||||
| 0 | 2014-01-01 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 4.0 | 4.0 | 4.0 | 4.0 |
| 2014-01-02 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 3.0 | 3.0 | 0.0 | 8.0 | 8.0 | 8.0 | 4.0 | |
| 2014-01-04 | 1.0 | 1.0 | 1.0 | 0.0 | 5.0 | 5.0 | 5.0 | 2.0 | 9.0 | 9.0 | 9.0 | 1.0 | |
| 2014-01-05 | 2.0 | 2.0 | 2.0 | 1.0 | 5.0 | 5.0 | 5.0 | 0.0 | 11.0 | 11.0 | 11.0 | 2.0 | |
| 2014-01-06 | 2.0 | 2.0 | 2.0 | 0.0 | 6.0 | 6.0 | 6.0 | 1.0 | 12.0 | 12.0 | 12.0 | 1.0 | |
| 2014-01-07 | 2.0 | 2.0 | 2.0 | 0.0 | 6.0 | 6.0 | 6.0 | 0.0 | 13.0 | 13.0 | 13.0 | 1.0 | |
| 2014-01-08 | 2.0 | 2.0 | 2.0 | 0.0 | 7.0 | 7.0 | 7.0 | 1.0 | 14.0 | 14.0 | 14.0 | 1.0 | |
| 2014-01-10 | 2.0 | 2.0 | 1.0 | 0.0 | 8.0 | 8.0 | 5.0 | 1.0 | 14.0 | 14.0 | 10.0 | 0.0 | |
| 2014-01-12 | 3.0 | 3.0 | 2.0 | 1.0 | 9.0 | 9.0 | 6.0 | 1.0 | 14.0 | 14.0 | 6.0 | 0.0 | |
| 2014-01-13 | 3.0 | 3.0 | 2.0 | 0.0 | 11.0 | 11.0 | 6.0 | 2.0 | 14.0 | 14.0 | 5.0 | 0.0 | |
| 2014-01-15 | 3.0 | 3.0 | 1.0 | 0.0 | 15.0 | 15.0 | 10.0 | 4.0 | 15.0 | 15.0 | 4.0 | 1.0 | |
| 2014-01-16 | 3.0 | 3.0 | 1.0 | 0.0 | 17.0 | 17.0 | 11.0 | 2.0 | 16.0 | 16.0 | 4.0 | 1.0 | |
| 2014-01-17 | 5.0 | 5.0 | 3.0 | 2.0 | 23.0 | 23.0 | 17.0 | 6.0 | 16.0 | 16.0 | 3.0 | 0.0 | |
| 2014-01-18 | 7.0 | 7.0 | 5.0 | 2.0 | 23.0 | 23.0 | 16.0 | 0.0 | 17.0 | 17.0 | 3.0 | 1.0 | |
| 2014-01-19 | 7.0 | 7.0 | 5.0 | 0.0 | 26.0 | 26.0 | 18.0 | 3.0 | 18.0 | 18.0 | 4.0 | 1.0 | |
| 2014-01-20 | 7.0 | 7.0 | 4.0 | 0.0 | 32.0 | 32.0 | 23.0 | 6.0 | 18.0 | 18.0 | 4.0 | 0.0 | |
| 2014-01-22 | 7.0 | 7.0 | 4.0 | 0.0 | 33.0 | 33.0 | 22.0 | 1.0 | 19.0 | 19.0 | 5.0 | 1.0 | |
| 2014-01-23 | 7.0 | 7.0 | 4.0 | 0.0 | 36.0 | 36.0 | 21.0 | 3.0 | 20.0 | 20.0 | 5.0 | 1.0 | |
| 2014-01-24 | 7.0 | 7.0 | 4.0 | 0.0 | 36.0 | 36.0 | 19.0 | 0.0 | 21.0 | 21.0 | 5.0 | 1.0 | |
| 2014-01-25 | 7.0 | 7.0 | 2.0 | 0.0 | 37.0 | 37.0 | 14.0 | 1.0 | 21.0 | 21.0 | 5.0 | 0.0 | |
| 2014-01-26 | 7.0 | 7.0 | 0.0 | 0.0 | 38.0 | 38.0 | 15.0 | 1.0 | 21.0 | 21.0 | 4.0 | 0.0 | |
| 2014-01-27 | 9.0 | 9.0 | 2.0 | 2.0 | 39.0 | 39.0 | 13.0 | 1.0 | 22.0 | 22.0 | 4.0 | 1.0 | |
| 2014-01-28 | 9.0 | 9.0 | 2.0 | 0.0 | 40.0 | 40.0 | 8.0 | 1.0 | 22.0 | 22.0 | 4.0 | 0.0 | |
| 2014-01-30 | 9.0 | 9.0 | 2.0 | 0.0 | 43.0 | 43.0 | 10.0 | 3.0 | 24.0 | 24.0 | 5.0 | 2.0 | |
| 2014-02-01 | 11.0 | 11.0 | 4.0 | 2.0 | 44.0 | 44.0 | 8.0 | 1.0 | 24.0 | 24.0 | 4.0 | 0.0 | |
| 2014-02-02 | 12.0 | 12.0 | 5.0 | 1.0 | 44.0 | 44.0 | 8.0 | 0.0 | 27.0 | 27.0 | 6.0 | 3.0 | |
| 2014-02-03 | 12.0 | 12.0 | 5.0 | 0.0 | 47.0 | 47.0 | 10.0 | 3.0 | 27.0 | 27.0 | 6.0 | 0.0 | |
| 2014-02-04 | 12.0 | 12.0 | 5.0 | 0.0 | 47.0 | 47.0 | 9.0 | 0.0 | 28.0 | 28.0 | 7.0 | 1.0 | |
| 2014-02-05 | 12.0 | 12.0 | 3.0 | 0.0 | 49.0 | 49.0 | 10.0 | 2.0 | 29.0 | 29.0 | 7.0 | 1.0 | |
| 2014-02-06 | 12.0 | 12.0 | 3.0 | 0.0 | 50.0 | 50.0 | 10.0 | 1.0 | 31.0 | 31.0 | 9.0 | 2.0 | |
| 2014-02-07 | 12.0 | 11.0 | 3.0 | 0.0 | 53.0 | 50.0 | 10.0 | 3.0 | 31.0 | 27.0 | 7.0 | 0.0 | |
| 2014-02-08 | 12.0 | 11.0 | 1.0 | 0.0 | 55.0 | 52.0 | 11.0 | 2.0 | 31.0 | 23.0 | 7.0 | 0.0 | |
| 2014-02-09 | 12.0 | 11.0 | 0.0 | 0.0 | 56.0 | 51.0 | 12.0 | 1.0 | 31.0 | 22.0 | 4.0 | 0.0 | |
| 2014-02-10 | 12.0 | 10.0 | 0.0 | 0.0 | 56.0 | 51.0 | 9.0 | 0.0 | 32.0 | 21.0 | 5.0 | 1.0 | |
| 2014-02-11 | 13.0 | 11.0 | 1.0 | 1.0 | 57.0 | 51.0 | 10.0 | 1.0 | 32.0 | 20.0 | 4.0 | 0.0 | |
| 2014-02-12 | 13.0 | 11.0 | 1.0 | 0.0 | 61.0 | 55.0 | 12.0 | 4.0 | 33.0 | 20.0 | 4.0 | 1.0 | |
| 2014-02-13 | 13.0 | 11.0 | 1.0 | 0.0 | 61.0 | 54.0 | 11.0 | 0.0 | 34.0 | 20.0 | 3.0 | 1.0 | |
| 2014-02-14 | 13.0 | 11.0 | 1.0 | 0.0 | 61.0 | 53.0 | 8.0 | 0.0 | 36.0 | 22.0 | 5.0 | 2.0 | |
| 2014-02-15 | 14.0 | 11.0 | 2.0 | 1.0 | 67.0 | 58.0 | 12.0 | 6.0 | 36.0 | 22.0 | 5.0 | 0.0 | |
| 2014-02-17 | 14.0 | 11.0 | 2.0 | 0.0 | 69.0 | 58.0 | 13.0 | 2.0 | 36.0 | 22.0 | 5.0 | 0.0 | |
| 2014-02-18 | 14.0 | 11.0 | 2.0 | 0.0 | 71.0 | 56.0 | 15.0 | 2.0 | 37.0 | 22.0 | 5.0 | 1.0 | |
| 2014-02-19 | 14.0 | 11.0 | 1.0 | 0.0 | 71.0 | 54.0 | 14.0 | 0.0 | 38.0 | 22.0 | 6.0 | 1.0 | |
| 2014-02-20 | 14.0 | 9.0 | 1.0 | 0.0 | 72.0 | 49.0 | 11.0 | 1.0 | 39.0 | 23.0 | 6.0 | 1.0 | |
| 2014-02-21 | 15.0 | 8.0 | 2.0 | 1.0 | 74.0 | 51.0 | 13.0 | 2.0 | 39.0 | 22.0 | 5.0 | 0.0 | |
| 2014-02-22 | 16.0 | 9.0 | 3.0 | 1.0 | 77.0 | 51.0 | 16.0 | 3.0 | 40.0 | 22.0 | 4.0 | 1.0 | |
| 2014-02-23 | 16.0 | 9.0 | 2.0 | 0.0 | 78.0 | 46.0 | 11.0 | 1.0 | 41.0 | 23.0 | 5.0 | 1.0 | |
| 2014-02-26 | 16.0 | 9.0 | 2.0 | 0.0 | 81.0 | 48.0 | 12.0 | 3.0 | 41.0 | 22.0 | 5.0 | 0.0 | |
| 2014-02-28 | 16.0 | 9.0 | 2.0 | 0.0 | 82.0 | 46.0 | 11.0 | 1.0 | 43.0 | 23.0 | 6.0 | 2.0 | |
| 2014-03-01 | 16.0 | 9.0 | 2.0 | 0.0 | 83.0 | 47.0 | 12.0 | 1.0 | 44.0 | 23.0 | 6.0 | 1.0 | |
| 2014-03-02 | 18.0 | 11.0 | 4.0 | 2.0 | 83.0 | 46.0 | 11.0 | 0.0 | 46.0 | 25.0 | 7.0 | 2.0 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 194537 entries, 0 to 194536 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cluster_label 194537 non-null int64 1 Crime_Date 194537 non-null object 2 Sum120 194537 non-null float64 3 Sum30 194537 non-null float64 4 Sum7 194537 non-null float64 5 Sum1 194537 non-null float64 6 SumAutoRelated120 194537 non-null float64 7 SumAutoRelated30 194537 non-null float64 8 SumAutoRelated7 194537 non-null float64 9 SumAutoRelated1 194537 non-null float64 10 SumOther120 194537 non-null float64 11 SumOther30 194537 non-null float64 12 SumOther7 194537 non-null float64 13 SumOther1 194537 non-null float64 dtypes: float64(12), int64(1), object(1) memory usage: 20.8+ MB
<ipython-input-247-f6c0db3a381a>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /Users/raghul/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| maxtempC | mintempC | cloudcover | precipMM | visibility | Crime_Date | |
|---|---|---|---|---|---|---|
| 0 | 5 | -2 | 22 | 0.0 | 10 | 2014-01-01 |
| 0 | 2 | -5 | 87 | 1.4 | 4 | 2014-01-02 |
| 0 | -9 | -12 | 32 | 0.0 | 10 | 2014-01-03 |
| 0 | -2 | -10 | 3 | 0.0 | 10 | 2014-01-04 |
| 0 | 3 | -4 | 78 | 8.4 | 8 | 2014-01-05 |
| ... | ... | ... | ... | ... | ... | ... |
| 0 | 30 | 18 | 2 | 0.0 | 10 | 2021-09-20 |
| 0 | 25 | 18 | 51 | 0.1 | 10 | 2021-09-21 |
| 0 | 25 | 20 | 60 | 2.5 | 10 | 2021-09-22 |
| 0 | 22 | 15 | 36 | 13.8 | 9 | 2021-09-23 |
| 0 | 23 | 13 | 0 | 0.0 | 10 | 2021-09-24 |
2824 rows × 6 columns
| cluster_label | Crime_Date | Sum120 | Sum30 | Sum7 | Sum1 | SumAutoRelated120 | SumAutoRelated30 | SumAutoRelated7 | SumAutoRelated1 | SumOther120 | SumOther30 | SumOther7 | SumOther1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2014-01-01 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 4.0 | 4.0 | 4.0 | 4.0 |
| 1 | 0 | 2014-01-02 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 3.0 | 3.0 | 0.0 | 8.0 | 8.0 | 8.0 | 4.0 |
| 2 | 0 | 2014-01-04 | 1.0 | 1.0 | 1.0 | 0.0 | 5.0 | 5.0 | 5.0 | 2.0 | 9.0 | 9.0 | 9.0 | 1.0 |
| 3 | 0 | 2014-01-05 | 2.0 | 2.0 | 2.0 | 1.0 | 5.0 | 5.0 | 5.0 | 0.0 | 11.0 | 11.0 | 11.0 | 2.0 |
| 4 | 0 | 2014-01-06 | 2.0 | 2.0 | 2.0 | 0.0 | 6.0 | 6.0 | 6.0 | 1.0 | 12.0 | 12.0 | 12.0 | 1.0 |
| cluster_label | Crime_Date | Sum120 | Sum30 | Sum7 | Sum1 | SumAutoRelated120 | SumAutoRelated30 | SumAutoRelated7 | SumAutoRelated1 | SumOther120 | SumOther30 | SumOther7 | SumOther1 | maxtempC | mintempC | cloudcover | precipMM | visibility | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2014-01-01 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 4.0 | 4.0 | 4.0 | 4.0 | 5 | -2 | 22 | 0.0 | 10 |
| 1 | 3 | 2014-01-01 | 49.0 | 15.0 | 3.0 | 1.0 | 35.0 | 12.0 | 1.0 | 0.0 | 79.0 | 17.0 | 3.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 2 | 4 | 2014-01-01 | 75.0 | 16.0 | 3.0 | 0.0 | 40.0 | 11.0 | 2.0 | 0.0 | 57.0 | 14.0 | 6.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 3 | 6 | 2014-01-01 | 58.0 | 18.0 | 4.0 | 0.0 | 75.0 | 17.0 | 2.0 | 1.0 | 53.0 | 11.0 | 5.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 4 | 8 | 2014-01-01 | 85.0 | 21.0 | 5.0 | 1.0 | 46.0 | 12.0 | 4.0 | 0.0 | 43.0 | 10.0 | 4.0 | 3.0 | 5 | -2 | 22 | 0.0 | 10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 194432 | 122 | 2020-04-26 | 77.0 | 24.0 | 3.0 | 0.0 | 44.0 | 6.0 | 1.0 | 0.0 | 58.0 | 16.0 | 3.0 | 1.0 | 20 | 11 | 92 | 30.9 | 8 |
| 194433 | 123 | 2020-04-26 | 88.0 | 23.0 | 8.0 | 0.0 | 31.0 | 2.0 | 1.0 | 1.0 | 83.0 | 13.0 | 0.0 | 0.0 | 20 | 11 | 92 | 30.9 | 8 |
| 194434 | 125 | 2020-04-26 | 70.0 | 18.0 | 5.0 | 1.0 | 26.0 | 2.0 | 0.0 | 0.0 | 80.0 | 26.0 | 6.0 | 0.0 | 20 | 11 | 92 | 30.9 | 8 |
| 194435 | 126 | 2020-04-26 | 46.0 | 11.0 | 2.0 | 0.0 | 35.0 | 5.0 | 3.0 | 0.0 | 108.0 | 31.0 | 4.0 | 1.0 | 20 | 11 | 92 | 30.9 | 8 |
| 194436 | 76 | 2021-09-24 | 85.0 | 17.0 | 2.0 | 0.0 | 67.0 | 18.0 | 4.0 | 1.0 | 177.0 | 58.0 | 11.0 | 0.0 | 23 | 13 | 0 | 0.0 | 10 |
194437 rows × 19 columns
| cluster_label | Crime_Date | Sum120 | Sum30 | Sum7 | Sum1 | SumAutoRelated120 | SumAutoRelated30 | SumAutoRelated7 | SumAutoRelated1 | SumOther120 | SumOther30 | SumOther7 | SumOther1 | maxtempC | mintempC | cloudcover | precipMM | visibility | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2014-01-01 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 4.0 | 4.0 | 4.0 | 4.0 | 5 | -2 | 22 | 0.0 | 10 |
| 1 | 3 | 2014-01-01 | 49.0 | 15.0 | 3.0 | 1.0 | 35.0 | 12.0 | 1.0 | 0.0 | 79.0 | 17.0 | 3.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 2 | 4 | 2014-01-01 | 75.0 | 16.0 | 3.0 | 0.0 | 40.0 | 11.0 | 2.0 | 0.0 | 57.0 | 14.0 | 6.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 3 | 6 | 2014-01-01 | 58.0 | 18.0 | 4.0 | 0.0 | 75.0 | 17.0 | 2.0 | 1.0 | 53.0 | 11.0 | 5.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 4 | 8 | 2014-01-01 | 85.0 | 21.0 | 5.0 | 1.0 | 46.0 | 12.0 | 4.0 | 0.0 | 43.0 | 10.0 | 4.0 | 3.0 | 5 | -2 | 22 | 0.0 | 10 |
| 5 | 9 | 2014-01-01 | 114.0 | 28.0 | 7.0 | 0.0 | 34.0 | 12.0 | 5.0 | 1.0 | 85.0 | 20.0 | 3.0 | 2.0 | 5 | -2 | 22 | 0.0 | 10 |
| 6 | 10 | 2014-01-01 | 85.0 | 18.0 | 1.0 | 1.0 | 57.0 | 15.0 | 2.0 | 0.0 | 101.0 | 21.0 | 6.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 7 | 11 | 2014-01-01 | 82.0 | 19.0 | 6.0 | 0.0 | 46.0 | 11.0 | 2.0 | 0.0 | 70.0 | 23.0 | 2.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 8 | 12 | 2014-01-01 | 68.0 | 17.0 | 3.0 | 0.0 | 42.0 | 19.0 | 2.0 | 0.0 | 62.0 | 11.0 | 3.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 9 | 13 | 2014-01-01 | 121.0 | 40.0 | 9.0 | 3.0 | 42.0 | 13.0 | 5.0 | 1.0 | 70.0 | 17.0 | 4.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 10 | 14 | 2014-01-01 | 59.0 | 22.0 | 6.0 | 1.0 | 57.0 | 14.0 | 5.0 | 2.0 | 70.0 | 11.0 | 1.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 11 | 16 | 2014-01-01 | 41.0 | 14.0 | 7.0 | 2.0 | 59.0 | 13.0 | 4.0 | 2.0 | 54.0 | 13.0 | 0.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 12 | 17 | 2014-01-01 | 85.0 | 22.0 | 4.0 | 1.0 | 50.0 | 12.0 | 2.0 | 0.0 | 87.0 | 20.0 | 2.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 13 | 20 | 2014-01-01 | 66.0 | 16.0 | 3.0 | 0.0 | 28.0 | 9.0 | 1.0 | 0.0 | 79.0 | 19.0 | 5.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 14 | 23 | 2014-01-01 | 100.0 | 25.0 | 4.0 | 1.0 | 48.0 | 15.0 | 3.0 | 0.0 | 92.0 | 19.0 | 6.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 15 | 26 | 2014-01-01 | 92.0 | 30.0 | 13.0 | 1.0 | 23.0 | 3.0 | 0.0 | 0.0 | 59.0 | 13.0 | 1.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 16 | 28 | 2014-01-01 | 103.0 | 32.0 | 11.0 | 1.0 | 29.0 | 6.0 | 0.0 | 0.0 | 50.0 | 8.0 | 0.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 17 | 29 | 2014-01-01 | 61.0 | 14.0 | 1.0 | 0.0 | 37.0 | 10.0 | 4.0 | 0.0 | 50.0 | 17.0 | 8.0 | 2.0 | 5 | -2 | 22 | 0.0 | 10 |
| 18 | 31 | 2014-01-01 | 71.0 | 14.0 | 5.0 | 1.0 | 40.0 | 16.0 | 4.0 | 1.0 | 45.0 | 11.0 | 2.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 19 | 32 | 2014-01-01 | 165.0 | 39.0 | 8.0 | 0.0 | 53.0 | 16.0 | 3.0 | 1.0 | 138.0 | 29.0 | 6.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 20 | 33 | 2014-01-01 | 125.0 | 32.0 | 7.0 | 1.0 | 26.0 | 5.0 | 1.0 | 0.0 | 61.0 | 10.0 | 3.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 21 | 36 | 2014-01-01 | 36.0 | 8.0 | 3.0 | 1.0 | 39.0 | 14.0 | 2.0 | 1.0 | 64.0 | 13.0 | 3.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 22 | 37 | 2014-01-01 | 101.0 | 26.0 | 8.0 | 1.0 | 32.0 | 10.0 | 1.0 | 1.0 | 60.0 | 16.0 | 3.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 23 | 39 | 2014-01-01 | 95.0 | 28.0 | 9.0 | 2.0 | 38.0 | 12.0 | 1.0 | 0.0 | 74.0 | 12.0 | 3.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 24 | 42 | 2014-01-01 | 69.0 | 19.0 | 5.0 | 0.0 | 43.0 | 12.0 | 1.0 | 0.0 | 49.0 | 10.0 | 2.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 25 | 46 | 2014-01-01 | 37.0 | 16.0 | 3.0 | 0.0 | 43.0 | 6.0 | 2.0 | 0.0 | 67.0 | 17.0 | 4.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 26 | 48 | 2014-01-01 | 97.0 | 24.0 | 6.0 | 0.0 | 36.0 | 10.0 | 2.0 | 0.0 | 54.0 | 11.0 | 1.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 27 | 49 | 2014-01-01 | 63.0 | 12.0 | 4.0 | 0.0 | 32.0 | 12.0 | 2.0 | 0.0 | 89.0 | 23.0 | 6.0 | 2.0 | 5 | -2 | 22 | 0.0 | 10 |
| 28 | 50 | 2014-01-01 | 81.0 | 16.0 | 4.0 | 0.0 | 59.0 | 18.0 | 2.0 | 0.0 | 41.0 | 7.0 | 2.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 29 | 51 | 2014-01-01 | 77.0 | 17.0 | 5.0 | 1.0 | 28.0 | 15.0 | 3.0 | 0.0 | 64.0 | 11.0 | 1.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 30 | 53 | 2014-01-01 | 49.0 | 11.0 | 1.0 | 0.0 | 41.0 | 11.0 | 2.0 | 0.0 | 70.0 | 10.0 | 2.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 31 | 54 | 2014-01-01 | 80.0 | 19.0 | 6.0 | 0.0 | 26.0 | 10.0 | 2.0 | 0.0 | 60.0 | 12.0 | 3.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 32 | 55 | 2014-01-01 | 46.0 | 14.0 | 4.0 | 2.0 | 43.0 | 17.0 | 6.0 | 1.0 | 93.0 | 24.0 | 2.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 33 | 56 | 2014-01-01 | 116.0 | 22.0 | 5.0 | 0.0 | 64.0 | 21.0 | 5.0 | 1.0 | 95.0 | 13.0 | 3.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 34 | 57 | 2014-01-01 | 75.0 | 15.0 | 3.0 | 0.0 | 28.0 | 12.0 | 3.0 | 0.0 | 53.0 | 13.0 | 3.0 | 2.0 | 5 | -2 | 22 | 0.0 | 10 |
| 35 | 60 | 2014-01-01 | 105.0 | 24.0 | 3.0 | 0.0 | 28.0 | 5.0 | 2.0 | 1.0 | 44.0 | 11.0 | 4.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 36 | 61 | 2014-01-01 | 88.0 | 25.0 | 5.0 | 0.0 | 37.0 | 9.0 | 0.0 | 0.0 | 60.0 | 13.0 | 2.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 37 | 62 | 2014-01-01 | 88.0 | 22.0 | 7.0 | 4.0 | 32.0 | 9.0 | 2.0 | 0.0 | 60.0 | 12.0 | 5.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 38 | 65 | 2014-01-01 | 127.0 | 28.0 | 5.0 | 0.0 | 43.0 | 16.0 | 4.0 | 0.0 | 70.0 | 15.0 | 6.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 39 | 71 | 2014-01-01 | 68.0 | 17.0 | 3.0 | 0.0 | 38.0 | 15.0 | 3.0 | 0.0 | 54.0 | 8.0 | 2.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 40 | 75 | 2014-01-01 | 102.0 | 31.0 | 7.0 | 2.0 | 25.0 | 4.0 | 0.0 | 0.0 | 56.0 | 9.0 | 2.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 41 | 77 | 2014-01-01 | 90.0 | 20.0 | 7.0 | 2.0 | 67.0 | 17.0 | 2.0 | 1.0 | 174.0 | 38.0 | 3.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 42 | 80 | 2014-01-01 | 70.0 | 12.0 | 3.0 | 0.0 | 19.0 | 6.0 | 3.0 | 0.0 | 58.0 | 17.0 | 3.0 | 2.0 | 5 | -2 | 22 | 0.0 | 10 |
| 43 | 81 | 2014-01-01 | 118.0 | 28.0 | 6.0 | 0.0 | 41.0 | 14.0 | 3.0 | 0.0 | 51.0 | 11.0 | 2.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 44 | 82 | 2014-01-01 | 90.0 | 25.0 | 6.0 | 1.0 | 33.0 | 7.0 | 2.0 | 0.0 | 65.0 | 11.0 | 5.0 | 2.0 | 5 | -2 | 22 | 0.0 | 10 |
| 45 | 87 | 2014-01-01 | 57.0 | 10.0 | 3.0 | 1.0 | 55.0 | 18.0 | 3.0 | 0.0 | 51.0 | 9.0 | 1.0 | 0.0 | 5 | -2 | 22 | 0.0 | 10 |
| 46 | 88 | 2014-01-01 | 71.0 | 18.0 | 3.0 | 1.0 | 44.0 | 17.0 | 3.0 | 0.0 | 64.0 | 14.0 | 5.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 47 | 89 | 2014-01-01 | 60.0 | 20.0 | 5.0 | 0.0 | 31.0 | 8.0 | 0.0 | 0.0 | 61.0 | 11.0 | 2.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 48 | 91 | 2014-01-01 | 68.0 | 21.0 | 7.0 | 0.0 | 41.0 | 8.0 | 0.0 | 0.0 | 45.0 | 11.0 | 1.0 | 1.0 | 5 | -2 | 22 | 0.0 | 10 |
| 49 | 94 | 2014-01-01 | 74.0 | 18.0 | 5.0 | 0.0 | 30.0 | 14.0 | 3.0 | 1.0 | 56.0 | 15.0 | 4.0 | 3.0 | 5 | -2 | 22 | 0.0 | 10 |
<class 'pandas.core.frame.DataFrame'> Int64Index: 194437 entries, 0 to 194436 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cluster_label 194437 non-null int64 1 Crime_Date 194437 non-null object 2 Sum120 194437 non-null float64 3 Sum30 194437 non-null float64 4 Sum7 194437 non-null float64 5 Sum1 194437 non-null float64 6 SumAutoRelated120 194437 non-null float64 7 SumAutoRelated30 194437 non-null float64 8 SumAutoRelated7 194437 non-null float64 9 SumAutoRelated1 194437 non-null float64 10 SumOther120 194437 non-null float64 11 SumOther30 194437 non-null float64 12 SumOther7 194437 non-null float64 13 SumOther1 194437 non-null float64 14 maxtempC 194437 non-null object 15 mintempC 194437 non-null object 16 cloudcover 194437 non-null object 17 precipMM 194437 non-null object 18 visibility 194437 non-null object dtypes: float64(12), int64(1), object(6) memory usage: 29.7+ MB
<class 'pandas.core.frame.DataFrame'> RangeIndex: 194537 entries, 0 to 194536 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cluster_label 194537 non-null int64 1 Crime_Date 194537 non-null object 2 isViolent 194537 non-null int64 dtypes: int64(2), object(1) memory usage: 4.5+ MB
Since we have grouped the data and taken a sum of the violent crimes, our isViolent column in the Y variable will have the sum of all the violent crimes for a given date. In order to make it a binary value of 0 or 1, we will put a condition to obtain our outcome variable.
31.944508022247142
<class 'pandas.core.frame.DataFrame'> RangeIndex: 194537 entries, 0 to 194536 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cluster_label 194537 non-null int64 1 Crime_Date 194537 non-null object 2 isViolent 194537 non-null int64 dtypes: int64(2), object(1) memory usage: 4.5+ MB
<class 'pandas.core.frame.DataFrame'> Int64Index: 150000 entries, 0 to 149999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sum120 150000 non-null float64 1 Sum30 150000 non-null float64 2 Sum7 150000 non-null float64 3 Sum1 150000 non-null float64 4 mintempC 150000 non-null object 5 maxtempC 150000 non-null object 6 cloudcover 150000 non-null object 7 precipMM 150000 non-null object 8 visibility 150000 non-null object dtypes: float64(4), object(5) memory usage: 11.4+ MB
0 87682 1 45477 2 12002 3 3180 4 1030 5 352 6 119 7 74 8 36 9 17 10 14 11 6 16 4 12 2 14 1 15 1 18 1 21 1 49 1 Name: isViolent, dtype: int64
Now we use a prediction model to predict the occurence of a violent crime in a given time and location. For this we use a Random Forest Regressor Model using the sklearn package.
A random forest regressor model is based on the supervised learning concept. It has a good performance on use cases with non-linearity. All of the trees in the random forest run in parallel with each other.
We input our independent and dependent variables for our model.
X's are our independent variables and Y is our set of dependent variables.
We split the dataset between train and test data set. We need the test data set to validate our model with the predictions.
We calculate the performance of the model based on some metrics of the random forest regressor model.
0 87682 1 62318 dtype: int64
<class 'pandas.core.frame.DataFrame'> Int64Index: 150000 entries, 0 to 149999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sum120 150000 non-null float64 1 Sum30 150000 non-null float64 2 Sum7 150000 non-null float64 3 Sum1 150000 non-null float64 4 mintempC 150000 non-null float64 5 maxtempC 150000 non-null float64 6 cloudcover 150000 non-null float64 7 precipMM 150000 non-null float64 8 visibility 150000 non-null float64 dtypes: float64(9) memory usage: 11.4 MB
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 150000 non-null int64 dtypes: int64(1) memory usage: 1.1 MB
<class 'pandas.core.frame.DataFrame'> Int64Index: 122999 entries, 0 to 122998 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sum120 122999 non-null float64 1 Sum30 122999 non-null float64 2 Sum7 122999 non-null float64 3 Sum1 122999 non-null float64 4 mintempC 122999 non-null float64 5 maxtempC 122999 non-null float64 6 cloudcover 122999 non-null float64 7 precipMM 122999 non-null float64 8 visibility 122999 non-null float64 dtypes: float64(9) memory usage: 9.4 MB
<ipython-input-266-a4d6e07f9187>:8: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
RandomForestRegressor(n_estimators=50, random_state=0)
For measuring the accuracy and precision of the forcast/prediction model we use various Key Performance Indicators (KPI) which gives us an approximate of the magnitude of error or deviation of the predicted values from the actual values. The KPI's we used to to measure the accuracy of our model are :
MAPE : Mean Absolute Percentage Error is one of the most commonly used KPI for measuring the prediction model accuracy. It is the sum of the individual absolute error divided by the demand which is basically the average of the percentage errors of the individual variables. MAE : Mean Absolute Error goes by it's name, it's the mean of absolute error and is a very good KPI to measure the precision of the forecast. It is different from MAPE as it is not scaled to the average demand. RMSE : Root Mean Sqaured Error is the square root of the average squared error and is also not scaled to the average demand like MAE. Mean Absolute Error: 0.49053864407138353 Mean Squared Error: 0.261238001879518 Root Mean Squared Error: 0.511114470426653
The RMSE and the MAE and the MSE can be made better by enhancing the performance of the model.
Accuracy for Random Forest 51.111447042665304
Observations :
This project aims to identify some important features for crime prediction in Baltimore area to provide police force planning and arranging resources to tackle crimes.
Based on our analysis, we find out that weather and time are the factors that have an impact on crime occurrence: